Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19856 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2012-11-22 13:36 -0800 |
| Last post | 2012-11-26 02:46 +0000 |
| Articles | 20 on this page of 39 — 10 participants |
Back to article view | Back to comp.lang.java.programmer
A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000
Page 1 of 2 [1] 2 Next page →
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-22 13:36 -0800 |
| Subject | A proposal to handle file encodings |
| Message-ID | <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> |
The problem with encodings is they are not attached in any way or embedded in any way in a file. You are just supposed to know how a file is encoded. Here is my idea to solve the problem. We invent a new encoding. Files in this encoding begin with a 0 byte, then an ASCII string giving the name of a conventional encoding then another 0 byte. When you read a file with this encoding, the header is invisible to your application. When you write a file, a header for a UTF8 file gets written automatically. You write your app telling it to read and write this new encoding e.g. "labeled". You can write a utilty to import files into your labelled universe by detecting or guessing or being told the encoding. It gets a header. Other than that the file is unmodified. -- Roedy Green Canadian Mind Products http://mindprod.com Students who hire or con others to do their homework are as foolish as couch potatoes who hire others to go to the gym for them.
[toc] | [next] | [standalone]
| From | Joerg Meier <joergmmeier@arcor.de> |
|---|---|
| Date | 2012-11-22 23:36 +0100 |
| Message-ID | <w7onpynnam8i$.9pkacjwybidu$.dlg@40tude.net> |
| In reply to | #19856 |
On Thu, 22 Nov 2012 13:36:16 -0800, Roedy Green wrote: > The problem with encodings is they are not attached in any way or > embedded in any way in a file. You are just supposed to know how a > file is encoded. > Here is my idea to solve the problem. > We invent a new encoding. > Files in this encoding begin with a 0 byte, then an ASCII string > giving the name of a conventional encoding then another 0 byte. > When you read a file with this encoding, the header is invisible to > your application. When you write a file, a header for a UTF8 file gets > written automatically. > You write your app telling it to read and write this new encoding e.g. > "labeled". > You can write a utilty to import files into your labelled universe by > detecting or guessing or being told the encoding. It gets a header. > Other than that the file is unmodified. I can't tell whether you are being serious or doing a joke about that old "You have 25 standards" joke. However, in case you are serious, this ugly and error prone hack idea really belongs more with a language capable of realizing OS level/file system black magic like that in a somewhat sensible way. Like C. Liebe Gruesse, Joerg -- Ich lese meine Emails nicht, replies to Email bleiben also leider ungelesen.
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-22 17:20 -0800 |
| Message-ID | <k8mj0d$2m2$1@dont-email.me> |
| In reply to | #19856 |
On 11/22/2012 1:36 PM, Roedy Green wrote: > The problem with encodings is they are not attached in any way or > embedded in any way in a file. You are just supposed to know how a > file is encoded. > > Here is my idea to solve the problem. > > We invent a new encoding. http://xkcd.com/927/
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-22 20:25 -0500 |
| Message-ID | <50aed080$0$292$14726298@news.sunsite.dk> |
| In reply to | #19856 |
On 11/22/2012 4:36 PM, Roedy Green wrote: > The problem with encodings is they are not attached in any way or > embedded in any way in a file. You are just supposed to know how a > file is encoded. > > Here is my idea to solve the problem. > > We invent a new encoding. > > Files in this encoding begin with a 0 byte, then an ASCII string > giving the name of a conventional encoding then another 0 byte. > > When you read a file with this encoding, the header is invisible to > your application. When you write a file, a header for a UTF8 file gets > written automatically. > > You write your app telling it to read and write this new encoding e.g. > "labeled". It is a bad idea to have meta data in the file body. This meta data should be where the rest of meta data are. But even if it was moved to the file info area then I doubt the idea is good. It is enforcing a limitation that a text file will only have one encoding, that limitation does not exist today. There are practical problems: * different systems support different encodings (sometimes same encoding has different name) - what should a system do with an unknown encoding * there will be a huge number of legacy files without this meta data - what should a system do with those And even if those problems were solved - would it really create any benefits? It would take many years to get such an approach approved and widely implemented. Most likely >10 years. At that time I would expect UTF-8 to be almost universal used for new text files. Making this proposal obsolete. > You can write a utility to import files into your labelled universe by > detecting or guessing or being told the encoding. Which just repeat the existing problems. > It gets a header. > Other than that the file is unmodified. Solved much easier by using meta data. Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-22 19:47 -0800 |
| Message-ID | <k8mrjv$5l6$1@dont-email.me> |
| In reply to | #19859 |
On 11/22/2012 5:25 PM, Arne Vajhøj wrote:
>
> Solved much easier by using meta data.
I think Roedy is talking about the physical encoding of the meta data.
I personally agree with him in this regard: meta data should be encoded
into the physical file.
Consider for example a meta data format that we all use: the Jar file.
Each single Jar file is actually composed of many pieces of information.
Class files, resources, libraries, the manifest file, etc. And yet
it's all encoded into a single physical file. You never loose pieces of
the file just because you made a copy of the file. You never have to
worry about the meta data changing on a new system just because it's *new*.
Contrast that with other schemes. Macintosh, I believe, uses a meta
data format where the data is in one file, and the meta data occupies a
second physical file with a name like .file-name.meta (I don't use Macs
so I'm not 100%) sure. So if you use a raw copy command ("cp" from the
Unix command line) you *don't* get the meta data, because you forgot to
copy it.
I hope you can all quickly see how obviously broken that is. Since we
all use Jar files I think you can all reflect on the idea that it's a
good solution. Have you ever had a problem with a Jar file retaining
its meta data? Is it ever desirable to have a Jar file's meta data
revert to nulls just because you FTP'ed the file someplace? I've never
desired that "feature".
It seems obvious to me. Encoding the meta data into a single physical
file is by far the better solution.
No, where I think Roedy goes wrong is to invent a *new* file format. My
solution: use what's there already, just use Jar files.
Proposal: Add a property "Data-Archive" like so:
Manifest-Version: 1.0
Data-Archive: /data
Where the value of the Data-Archive is the path to the primary data
stream (within the Zip/Jar file). You can just add an encoding or
mime-type or any other property to the manifest you like to describe
your data stream and you're set.
Note that this is already being done. Open Office uses Jar files as its
native file format. They just rename the extension as they wish, and
open the file appropriately for a Jar file. They also store a lot more
meta data than just a couple of properties, so they effectively have
their own format, not this simple one.
It might be useful to try to solve some common cases for data and
meta-data. What I've got here is a single data stream and a single
"type" property. It wouldn't be hard to extend this to several streams
and several properties each. I think that would be the only other
useful general case; after that you should just roll your own solution.
BTW if anyone is copying this up to their website (mindprod), please
credit appropriately: Brenden Towey.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-22 21:28 -0800 |
| Message-ID | <jd1ua8h786rv5qrm2ejtt5kge0jeh0c7kr@4ax.com> |
| In reply to | #19863 |
On Thu, 22 Nov 2012 19:47:09 -0800, markspace <-@.> wrote, quoted or indirectly quoted someone who said : >Each single Jar file is actually composed of many pieces of information. > Class files, resources, libraries, the manifest file, etc. And yet >it's all encoded into a single physical file. You never loose pieces of >the file just because you made a copy of the file. You never have to >worry about the meta data changing on a new system just because it's *new*. Yes, yes! The OS people have proved incompetent at keeping metadata separately from the file. We need formats where the metadata is part of the file. With text files the most important piece of metadata is the encoding. We do it sometimes, jpg, jar, csv (sometimes), video files, More generally the mime type is something you should be able to get with File.getMime() Imagine if you could do: File.getEncoding() File.getVersion() File.getCopyrightOwner() File.getCopyrightDate() Meta data-compliant file would look just like any other but with a header of the form 0 <meta>...</meta> 0 The meta data could be stored as XML. That gives you ability to add extra info without having to change the standard. the header is in ASCII 7-bit. We should be using somewhat more complicated formats for files with embedded metadata. As an application programmer you want to be able to have the system parse it for you. You get to pretend it is not there, but with the ability to query it. This reminds me a bit of the innovation of ANSI labelled mag tapes back in the 60s. The bBase people got this right long ago. You don't go writing files without a header describing the format of what was in the file. -- Roedy Green Canadian Mind Products http://mindprod.com Students who hire or con others to do their homework are as foolish as couch potatoes who hire others to go to the gym for them.
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-24 15:51 +0000 |
| Message-ID | <k8qqev$ehi$1@localhost.localdomain> |
| In reply to | #19864 |
On Thu, 22 Nov 2012 21:28:55 -0800, Roedy Green wrote: > > The bBase people got this right long ago. You don't go writing files > without a header describing the format of what was in the file. > IBM got it pretty much right in the OS/400 operating system. The metadata, which is held in the filing system catalogue, is transparently and permanently associated with the file. Its a general mechanism: the system provides standard metadata for source files, executables etc. and the developer creates the metadata for, e.g. fixed field data files with keyed access. The only demerit is that it uses a rather ugly two level filing system. The UNIX/Linux equivalent would be to keep the meta-data in the file's inode alongside the access permissions and to modify the file copy and move operations to silently handle the metadata by duplicating that part of the inode as and when needed. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-11-25 10:18 +0100 |
| Message-ID | <slrnkb3ojp.qr8.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #19896 |
On 2012-11-24 15:51, Martin Gregorie <martin@address-in-sig.invalid> wrote: > IBM got it pretty much right in the OS/400 operating system. The metadata, > which is held in the filing system catalogue, is transparently and > permanently associated with the file. Its a general mechanism: the system > provides standard metadata for source files, executables etc. and the > developer creates the metadata for, e.g. fixed field data files with > keyed access. The only demerit is that it uses a rather ugly two level > filing system. > > The UNIX/Linux equivalent would be to keep the meta-data in the file's > inode alongside the access permissions File attributes have existed on ext* filesystems for a very long time. > and to modify the file copy and move operations There is no file copy operation on the OS level. The kernel just sees that a process is creating and writing a new file. It doesn't know whether this process intends this new file to be an identical copy of some other file. rename(2) of course preserves file attributes, because it doesn't change the file at all (except the ctime entry), only the directories linking to it. cp, rsync, tar, etc. have options to copy the attributes along with the "normal" content. But the problem is that there are a lot of utilities working on files and they would all have to be modified. And worse, there isn't any standard for using those attributes, so nobody uses them, so there is little incentive to modify them. hp -- _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung: |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis | | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-25 18:05 +0000 |
| Message-ID | <k8tmlj$6jr$3@localhost.localdomain> |
| In reply to | #19924 |
On Sun, 25 Nov 2012 10:18:49 +0100, Peter J. Holzer wrote: > > File attributes have existed on ext* filesystems for a very long time. > Yes, but only pretty basic ones. Here we're talking about hypothetically storing stuff like character encoding or, as I suggested, the record and key definitions for an indexed sequential file or a DBMS table. As I said, I know OSen that do this type of thing: it works well and even supports things like letting compilers access the metadata. This lets things like special C preprocessors generate #includes from it, COBOL COPY statements access it directly and ODBC/JDBC drivers to use it at runtime. > There is no file copy operation on the OS level. The kernel just sees > that a process is creating and writing a new file. It doesn't know > whether this process intends this new file to be an identical copy of > some other file. > Of course, but if the metadata is external to the file as it is in the 'other fork' in an Apple filing system, you still have to make sure that cp, mv and friends have all been rewritten to handle that. You may well find that its easier to pull metadata management into the kernel because then you've only got one piece of code to maintain rather than tweaks in umpteen utility programs and libraries. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-11-27 19:51 +0100 |
| Message-ID | <slrnkba2tp.k8a.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #19946 |
On 2012-11-25 18:05, Martin Gregorie <martin@address-in-sig.invalid> wrote: > On Sun, 25 Nov 2012 10:18:49 +0100, Peter J. Holzer wrote: >> File attributes have existed on ext* filesystems for a very long time. >> > Yes, but only pretty basic ones. They are arbitrary key/value pairs. You can put any information there, there is no restriction to "basic" information (whatever that might be). They are limited to a single block (typically 4kB), though, so MIME type, character set, keywords, etc. are ok, but a thumbnail image might be problematic. > Here we're talking about hypothetically storing stuff like character > encoding This one is even somewhat standardized: user.charset is documented on http://www.freedesktop.org/wiki/CommonExtendedAttributes which probably means that some GUI programs are actually using it (besides the Apache module where it originated). To return to the topic of this group: Is there a Java library for setting and retrieving xattrs? >> There is no file copy operation on the OS level. The kernel just sees >> that a process is creating and writing a new file. It doesn't know >> whether this process intends this new file to be an identical copy of >> some other file. >> > Of course, but if the metadata is external to the file as it is in the > 'other fork' in an Apple filing system, you still have to make sure that > cp, mv and friends have all been rewritten to handle that. Why "but"? That's exactly what I wrote. The kernel doesn't know what the a process is intending to do with a file, therefore programs like cp, tar, etc. must be rewritten to handle xattrs explicitely. (And many of them have been rewritten, of course. Xattrs aren't new) > You may well find that its easier to pull metadata management into the > kernel because then you've only got one piece of code to maintain > rather than tweaks in umpteen utility programs and libraries. The problem is that this just doesn't fit into the Unix system call scheme. There is no "copy" system call. The kernel just sees that a process opens one file for reading and another file for writing. It cannot assume that this process wants to copy the metadata from the first to the second file. Of course Linux could introduce such a system call, but then those umpteen utility programs and libraries would still have to be modified to use that new system call. hp -- _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung: |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis | | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-29 02:22 +0000 |
| Message-ID | <k96gtp$eia$1@localhost.localdomain> |
| In reply to | #19998 |
On Tue, 27 Nov 2012 19:51:37 +0100, Peter J. Holzer wrote: > The problem is that this just doesn't fit into the Unix system call > scheme. There is no "copy" system call. The kernel just sees that a > process opens one file for reading and another file for writing. It > cannot assume that this process wants to copy the metadata from the > first to the second file. > Of course. > Of course Linux could introduce such a system > call, but then those umpteen utility programs and libraries would still > have to be modified to use that new system call. > I can see two ways of handling it: (1) introduce a pair of systems calls to retrieve and store the metadata associated with a file, and, yes, programs would need modification, but the amount would be trivial because you'd be looking at one extra line of code per file involved in the metadata transfer. (2) alternatively it may be possible to do the job by adding a mode or to to the file opening operations. If they were defaulted appropriately, many programs could silently copy the metadata along with the data and/or automagically apply the appropriate transforms, such as charset transforms, during the transfer. Thinking about it a little more, (2) is definitely the best solution because it would be rather useful to be able to default the metadata applied to a new file with a similar mechanism to that used for the permission bits. This sort of thing would be much easier to manage it is was built in to the filing system. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-12-02 13:02 +0100 |
| Message-ID | <slrnkbmgqj.loc.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #20015 |
On 2012-11-29 02:22, Martin Gregorie <martin@address-in-sig.invalid> wrote:
> On Tue, 27 Nov 2012 19:51:37 +0100, Peter J. Holzer wrote:
>> The problem is that this just doesn't fit into the Unix system call
>> scheme. There is no "copy" system call. The kernel just sees that a
>> process opens one file for reading and another file for writing. It
>> cannot assume that this process wants to copy the metadata from the
>> first to the second file.
>>
> Of course.
>
>> Of course Linux could introduce such a system
>> call, but then those umpteen utility programs and libraries would still
>> have to be modified to use that new system call.
>>
> I can see two ways of handling it:
>
> (1) introduce a pair of systems calls to retrieve and store the metadata
> associated with a file,
There are of course already system calls to do that (how else would you
get at the data?). There are four of them (list, get, set, remove),
however, not two, so ...
> and, yes, programs would need modification, but the amount would be
> trivial because you'd be looking at one extra line of code per file
> involved in the metadata transfer.
... it's 3 extra lines, not 1. Not including error handling, of course.
But I don't think that's the problem. The problem is that a) you have to
do it and b) you have to think about how to do it. Plus there is no
consensus that it should be done at all (user_xattr isn't even enabled
by default on ext*). Microsoft and Apple have it easier: If they say
that some information has to be stored in an alternate stream/resource
fork, programmers will do it. Linux has no central authority which can
force programmers to do anything.
> (2) alternatively it may be possible to do the job by adding a mode or to
> to the file opening operations.
You mean an optional 4th parameter to open(2)?
> If they were defaulted appropriately, many programs could silently
> copy the metadata along with the data
I still don't see how that could work. That implies that the kernel
somehow guesses that you want to use the metadata from some file you
opened for reading for the file you are just opening for writing. While
that would be the right behaviour for "cp" or similar programs, it doubt
it would be right for the majority of programs.
It also raises the question of what the kernel should do if the process
doesn't have the necessary privileges to set some xattrs (or if the file
system doesn't support them). Fail? Silently drop them? I don't think
the kernel should make that decision. It's up to the application to
decide what's sensible ("mechanism, not policy" was a guiding principle
in the design of the Unix system call interface).
> and/or automagically apply the appropriate transforms, such as charset
> transforms, during the transfer.
That again makes no sense at the unix system call interface which deals
only with byte streams.
It does however make a lot of sense for higher level interfaces. So
it might be a good idea for java.io.FileReader to check the user.charset
xattr of the file and apply the appropriate encoding.
> Thinking about it a little more, (2) is definitely the best solution
> because it would be rather useful to be able to default the metadata
> applied to a new file with a similar mechanism to that used for the
> permission bits.
umask(2) is actually pretty broken IMHO.
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-12-02 19:36 +0000 |
| Message-ID | <k9gajc$t6t$1@localhost.localdomain> |
| In reply to | #20039 |
On Sun, 02 Dec 2012 13:02:27 +0100, Peter J. Holzer wrote:
> On 2012-11-29 02:22, Martin Gregorie <martin@address-in-sig.invalid>
> wrote:
>> (2) alternatively it may be possible to do the job by adding a mode or
>> to to the file opening operations.
>
> You mean an optional 4th parameter to open(2)?
>
No, what I said - an extra mode or two. If you didn't want the defaults
you'd OR them with the other modes.
> I still don't see how that could work. That implies that the kernel
> somehow guesses that you want to use the metadata from some file you
> opened for reading for the file you are just opening for writing. While
> that would be the right behaviour for "cp" or similar programs, it doubt
> it would be right for the majority of programs.
>
It obviously wouldn't apply if the other file was stdin/stdout/stderr
and, in fact many (most) programs that have a file open for reading and
another for writing would probably want to copy the metadata unless it
was a compiler or something else that applies major transformations to
the data its handling: in these cases you'd expect to specify the metadata
explicitly or to use an OS predefined matedata set.
> It also raises the question of what the kernel should do if the process
> doesn't have the necessary privileges to set some xattrs (or if the file
> system doesn't support them). Fail?
>
Why would that be treated any differently to access privileges? If the
requested combination of attributes are nonsensical (e.g. trying the
write a binary stream to a file of keyed records, or violate an OS-
defined rule, the file simply wouldn't open.
> That again makes no sense at the unix system call interface which deals
> only with byte streams.
>
But, by definition, if you were using metadata to control the character
encoding (which is where this discussion started) or to define the file
as containing keyed, fixed field records, you would not be trying to
write a byte stream. If you tried something like that I'd expect that
either you'd get a compile time exception or for the file management
subsystem to throw an error at runtime. The compile-time error would be
preferable and is more or less what Java does.
Equally, if you were just diddling with the character encoding, that
should just work unless you were attempting to use an unsupported or non-
sensible conversion. For instance:
- ASCII to one of the Windows code pages would leave 0x00 to 0x7f
unchanged (though the high order bits may need to be modified) and
simply change the metadata to tell consumers of the file what
encoding to use.
- ASCII->EBCDIC and EBCDIC->ASCII would have to recode every byte.
except that there are some characters ('{' and '}') which, IIRC are not
part of the EBCDIC character set in at least some dialects.
- some transforms would be one way: ASCII to utf-8 is ok, but IIRC the
reverse would fail and ISO 6 bit or Baudot to anything else should work
but the reverse is probably not possible.
>> Thinking about it a little more, (2) is definitely the best solution
>> because it would be rather useful to be able to default the metadata
>> applied to a new file with a similar mechanism to that used for the
>> permission bits.
>
> umask(2) is actually pretty broken IMHO.
>
IME it has few surprises unless you're moving files between users with
different umasks.
I don't know if you've used OSen that support the sort of extreme metadata
I'm talking about. I have and it can be rather convenient. Here's a
couple of nice examples:
- use the metadata to set the backup frequency for a file, the number
of generations of the backup to be kept, and the number of parallel
backups to be done.
- (for a print file) use metadata to specify the printer capabilities
needed to print the file and the type of paper required. This could be
used by the program to match its output to the available paper size
(think A4 vs US Letter) as well as making sure that the output is
sent to a printer with the right paper and capabilities to output it.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-12-02 23:52 +0100 |
| Message-ID | <slrnkbnmua.24m.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #20044 |
On 2012-12-02 19:36, Martin Gregorie <martin@address-in-sig.invalid> wrote: > On Sun, 02 Dec 2012 13:02:27 +0100, Peter J. Holzer wrote: >> That again makes no sense at the unix system call interface which deals >> only with byte streams. >> > But, by definition, if you were using metadata to control the character > encoding (which is where this discussion started) or to define the file > as containing keyed, fixed field records, you would not be trying to > write a byte stream. We were obviously talking past each other. I was only talking about mechanisms like xattr, alternate streams or resource forks, not about revamping the whole unix file model. hp -- _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung: |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis | | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-12-02 23:08 +0000 |
| Message-ID | <k9gn1q$1e9$1@localhost.localdomain> |
| In reply to | #20047 |
On Sun, 02 Dec 2012 23:52:58 +0100, Peter J. Holzer wrote: > We were obviously talking past each other. I was only talking about > mechanisms like xattr, alternate streams or resource forks, not about > revamping the whole unix file model. > I think that's likely. I've obviously not seen the guts of OS X resource forks, but I doubt their implementation differs a lot from what I was talking about: if they are not part of the OS's file handling system[*] they'd require all that messy stuff to be implemented in system programs and user-level libraries that you've described. I'm not advocating that Linux becomes OS/400 lite, just pointing out that metadata can be used in many ways, and that once you introduce the mechanism to transparently handle one attribute, such as character encodings, that there's quite a lot more that it could be used for. [*] I'm deliberately not saying Kernel because a lot of file handling stuff has already moved out of the kernel. A certain resemblance to Mach is creeping into Linux, though so far it is not nearly so fine-grained. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | Sven Köhler <remove-sven.koehler@gmail.com> |
|---|---|
| Date | 2012-11-25 13:13 +0100 |
| Message-ID | <ahega5F7sv0U1@mid.dfncis.de> |
| In reply to | #19859 |
Am 23.11.2012 02:25, schrieb Arne Vajhøj: > It is a bad idea to have meta data in the file body. This meta data > should be where the rest of meta data are. Now which OS actually supports this idea? Are you saying that XML is bad, because it contains metadata (i.e. the encoding/charset) inside the file body?
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2012-11-25 18:07 +0000 |
| Message-ID | <k8tmpd$6jr$4@localhost.localdomain> |
| In reply to | #19930 |
On Sun, 25 Nov 2012 13:13:25 +0100, Sven Köhler wrote: > Am 23.11.2012 02:25, schrieb Arne Vajhøj: >> It is a bad idea to have meta data in the file body. This meta data >> should be where the rest of meta data are. > > Now which OS actually supports this idea? > IBM's OS/400, the late lamented ICL VME/B and (probably) Apple's OS X and iOS -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | Jan Burse <janburse@fastmail.fm> |
|---|---|
| Date | 2012-11-23 16:33 +0100 |
| Message-ID | <k8o50f$1q6$1@news.albasani.net> |
| In reply to | #19856 |
Hi, If your files are HTML, then you can note the encoding in the header, via a meta tag: <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> </head> <body> </body> </html> http://de.wikipedia.org/wiki/Meta-Element#.C3.84quivalente_zu_HTTP-Kopfdaten If your files are XML, then you can note the encoding in the xml tag: <?xml version="1.0" encoding="ISO-8859-1"?> http://de.wikipedia.org/wiki/XML-Deklaration If your file is plain text, you can insert a BOM, which allows to automatically detect a couple of encoding. And skip the BOM during reading. The BOM is: \uFEFF http://de.wikipedia.org/wiki/Byte_Order_Mark Would this not cover your requirements? Bye Roedy Green schrieb: > The problem with encodings is they are not attached in any way or > embedded in any way in a file. You are just supposed to know how a > file is encoded. > > Here is my idea to solve the problem. > > We invent a new encoding. > > Files in this encoding begin with a 0 byte, then an ASCII string > giving the name of a conventional encoding then another 0 byte. > > When you read a file with this encoding, the header is invisible to > your application. When you write a file, a header for a UTF8 file gets > written automatically. > > You write your app telling it to read and write this new encoding e.g. > "labeled". > > You can write a utilty to import files into your labelled universe by > detecting or guessing or being told the encoding. It gets a header. > Other than that the file is unmodified. >
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-23 09:02 -0800 |
| Message-ID | <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> |
| In reply to | #19865 |
On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm> wrote, quoted or indirectly quoted someone who said : > >Would this not cover your requirements? The problem is primarily raw text files with no indication of the encoding. The HTML encoding is incompetent. You can't read it without knowing the encoding. It is just a confirmation. Thankfully the encoding comes in the HTTP header -- a case where meta information is available. I feel angry about this. What asshole dreamed up the idea of exchanging files in various encodings without any labelling of the encoding? That there is no universal way of identifying the format of a file is astounding. Parents who thought this way would send their kids out into the world not knowing their names, addresses, or genders. It sounds like something one of those people who live on beer and pizza, with a roomful of old pizza boxes lying around would have come up with. I wish Martha Stewart had gone into programming. -- Roedy Green Canadian Mind Products http://mindprod.com Students who hire or con others to do their homework are as foolish as couch potatoes who hire others to go to the gym for them.
[toc] | [prev] | [next] | [standalone]
| From | Jan Burse <janburse@fastmail.fm> |
|---|---|
| Date | 2012-11-23 19:21 +0100 |
| Message-ID | <k8oers$p98$1@news.albasani.net> |
| In reply to | #19867 |
Roedy Green schrieb:
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.
For example when you edit a HTML file locally, you don't
have this HTTP header information. Also where does the HTTP
header get the charset information in the first place?
Scenario 1:
- HTTP returns only mimetype=text/html without
the chartset option.
- The browser then reads the HTML doc meta tag, and
adjust the charset.
Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
fetched from the HTML file meta tag.
- The browser does not read the HTML doc meta tag, and
follows the charset found in the mimetype.
In both scenarios 1 + 2, the meta tag is used. Don't
know whether there is a scenario 3, and where should
this scenario take the encoding from?
Bye
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web