Groups > comp.lang.java.programmer > #19856 > unrolled thread

A proposal to handle file encodings

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2012-11-22 13:36 -0800
Last post	2012-11-26 02:46 +0000
Articles	20 on this page of 39 — 10 participants

Back to article view | Back to comp.lang.java.programmer

  A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
    Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
    Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
    Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
      Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
        Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
          Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
              Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
                  Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
                    Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
                      Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
                        Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
                          Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
      Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
        Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
    Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
      Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
        Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
          Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
              Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
                  Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
                    Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
            Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
            Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
          Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
        Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
        Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
          Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
            Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
              Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000

Page 1 of 2 [1] 2 Next page →

#19856 — A proposal to handle file encodings

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-22 13:36 -0800
Subject	A proposal to handle file encodings
Message-ID	<lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com>

The problem with encodings is they are not attached in any way or
embedded in any way in a file. You are just supposed to know how a
file is encoded.

Here is my idea to solve the problem.

We invent a new encoding.  

Files in this encoding begin with a 0 byte, then an ASCII string
giving the name of a conventional encoding then another 0 byte.

When you read a file with this encoding, the header is invisible to
your application. When you write a file, a header for a UTF8 file gets
written automatically.

You write your app telling it to read and write this new encoding e.g.
"labeled".

You can write a utilty to import files into your labelled universe by
detecting or guessing or being told the encoding.  It gets a header.
Other than that the file is unmodified.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [next] | [standalone]

#19857

From	Joerg Meier <joergmmeier@arcor.de>
Date	2012-11-22 23:36 +0100
Message-ID	<w7onpynnam8i$.9pkacjwybidu$.dlg@40tude.net>
In reply to	#19856

On Thu, 22 Nov 2012 13:36:16 -0800, Roedy Green wrote:

> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.

> Here is my idea to solve the problem.

> We invent a new encoding.  

> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.

> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.

> You write your app telling it to read and write this new encoding e.g.
> "labeled".

> You can write a utilty to import files into your labelled universe by
> detecting or guessing or being told the encoding.  It gets a header.
> Other than that the file is unmodified.

I can't tell whether you are being serious or doing a joke about that old
"You have 25 standards" joke.

However, in case you are serious, this ugly and error prone hack idea
really belongs more with a language capable of realizing OS level/file
system black magic like that in a somewhat sensible way. Like C.

Liebe Gruesse,
		Joerg

-- 
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

[toc] | [prev] | [next] | [standalone]

#19858

From	markspace <-@.>
Date	2012-11-22 17:20 -0800
Message-ID	<k8mj0d$2m2$1@dont-email.me>
In reply to	#19856

On 11/22/2012 1:36 PM, Roedy Green wrote:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.


http://xkcd.com/927/

[toc] | [prev] | [next] | [standalone]

#19859

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-22 20:25 -0500
Message-ID	<50aed080$0$292$14726298@news.sunsite.dk>
In reply to	#19856

On 11/22/2012 4:36 PM, Roedy Green wrote:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.
>
> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.
>
> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.
>
> You write your app telling it to read and write this new encoding e.g.
> "labeled".

It is a bad idea to have meta data in the file body. This meta data
should be where the rest of meta data are.

But even if it was moved to the file info area then I doubt
the idea is good.

It is enforcing a limitation that a text file will only have
one encoding, that limitation does not exist today.

There are practical problems:
* different systems support different encodings (sometimes
   same encoding has different name) - what should a system
   do with an unknown encoding
* there will be a huge number of legacy files without this meta
   data - what should a system do with those

And even if those problems were solved - would it really create
any benefits?

It would take many years to get such an approach approved and
widely implemented. Most likely >10 years. At that time I would
expect UTF-8 to be almost universal used for new text files.
Making this proposal obsolete.

 > You can write a utility to import files into your labelled universe by
 > detecting or guessing or being told the encoding.

Which just repeat the existing problems.

 >                                                   It gets a header.
 > Other than that the file is unmodified.

Solved much easier by using meta data.

Arne

[toc] | [prev] | [next] | [standalone]

#19863

From	markspace <-@.>
Date	2012-11-22 19:47 -0800
Message-ID	<k8mrjv$5l6$1@dont-email.me>
In reply to	#19859

On 11/22/2012 5:25 PM, Arne Vajhøj wrote:
>
> Solved much easier by using meta data.

I think Roedy is talking about the physical encoding of the meta data. 
I personally agree with him in this regard:  meta data should be encoded 
into the physical file.

Consider for example a meta data format that we all use: the Jar file.

Each single Jar file is actually composed of many pieces of information. 
  Class files, resources, libraries, the manifest file, etc.  And yet 
it's all encoded into a single physical file.  You never loose pieces of 
the file just because you made a copy of the file.  You never have to 
worry about the meta data changing on a new system just because it's *new*.

Contrast that with other schemes.  Macintosh, I believe, uses a meta 
data format where the data is in one file, and the meta data occupies a 
second physical file with a name like .file-name.meta (I don't use Macs 
so I'm not 100%) sure.  So if you use a raw copy command ("cp" from the 
Unix command line) you *don't* get the meta data, because you forgot to 
copy it.

I hope you can all quickly see how obviously broken that is.  Since we 
all use Jar files I think you can all reflect on the idea that it's a 
good solution.  Have you ever had a problem with a Jar file retaining 
its meta data?  Is it ever desirable to have a Jar file's meta data 
revert to nulls just because you FTP'ed the file someplace?  I've never 
desired that "feature".

It seems obvious to me.  Encoding the meta data into a single physical 
file is by far the better solution.

No, where I think Roedy goes wrong is to invent a *new* file format.  My 
solution: use what's there already, just use Jar files.

Proposal: Add a property "Data-Archive" like so:

Manifest-Version: 1.0
Data-Archive: /data

Where the value of the Data-Archive is the path to the primary data 
stream (within the Zip/Jar file).  You can just add an encoding or 
mime-type or any other property to the manifest you like to describe 
your data stream and you're set.

Note that this is already being done.  Open Office uses Jar files as its 
native file format.  They just rename the extension as they wish, and 
open the file appropriately for a Jar file.  They also store a lot more 
meta data than just a couple of properties, so they effectively have 
their own format, not this simple one.

It might be useful to try to solve some common cases for data and 
meta-data.  What I've got here is a single data stream and a single 
"type" property.  It wouldn't be hard to extend this to several streams 
and several properties each.  I think that would be the only other 
useful general case; after that you should just roll your own solution.

BTW if anyone is copying this up to their website (mindprod), please 
credit appropriately: Brenden Towey.

[toc] | [prev] | [next] | [standalone]

#19864

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-22 21:28 -0800
Message-ID	<jd1ua8h786rv5qrm2ejtt5kge0jeh0c7kr@4ax.com>
In reply to	#19863

On Thu, 22 Nov 2012 19:47:09 -0800, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>Each single Jar file is actually composed of many pieces of information. 
>  Class files, resources, libraries, the manifest file, etc.  And yet 
>it's all encoded into a single physical file.  You never loose pieces of 
>the file just because you made a copy of the file.  You never have to 
>worry about the meta data changing on a new system just because it's *new*.

Yes, yes! The OS people have proved incompetent at keeping metadata
separately from the file. We need formats where the metadata is part
of the file.  With text files the most important piece of metadata is
the encoding. We do it sometimes, jpg, jar, csv (sometimes), video
files, 

More generally the mime type is something you should be able to get
with File.getMime()

Imagine if you could do:

File.getEncoding()
File.getVersion()
File.getCopyrightOwner()
File.getCopyrightDate()

Meta data-compliant file would look just like any other but with a
header of the form
0 <meta>...</meta> 0

The meta data could be stored as XML. That gives you ability to add
extra info without having to change the standard.

the header is in ASCII 7-bit.


We should be using somewhat more complicated formats for files with
embedded metadata.

As an application programmer you want to be able to have the system
parse it for you. You get to pretend it is not there, but with the
ability to query it.

This reminds me a bit of the innovation of  ANSI labelled mag tapes
back in the 60s.

The bBase people got this right long ago.  You don't go writing files
without a header describing the format of what was in the file.
 

-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [prev] | [next] | [standalone]

#19896

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-11-24 15:51 +0000
Message-ID	<k8qqev$ehi$1@localhost.localdomain>
In reply to	#19864

On Thu, 22 Nov 2012 21:28:55 -0800, Roedy Green wrote:

> 
> The bBase people got this right long ago.  You don't go writing files
> without a header describing the format of what was in the file.
>
IBM got it pretty much right in the OS/400 operating system. The metadata, 
which is held in the filing system catalogue, is transparently and 
permanently associated with the file. Its a general mechanism: the system 
provides standard metadata for source files, executables etc. and the 
developer creates the metadata for, e.g. fixed field data files with 
keyed access. The only demerit is that it uses a rather ugly two level 
filing system. 

The UNIX/Linux equivalent would be to keep the meta-data in the file's 
inode alongside the access permissions and to modify the file copy and 
move operations to silently handle the metadata by duplicating that part 
of the inode as and when needed.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#19924

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-25 10:18 +0100
Message-ID	<slrnkb3ojp.qr8.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19896

On 2012-11-24 15:51, Martin Gregorie <martin@address-in-sig.invalid> wrote:
> IBM got it pretty much right in the OS/400 operating system. The metadata, 
> which is held in the filing system catalogue, is transparently and 
> permanently associated with the file. Its a general mechanism: the system 
> provides standard metadata for source files, executables etc. and the 
> developer creates the metadata for, e.g. fixed field data files with 
> keyed access. The only demerit is that it uses a rather ugly two level 
> filing system. 
>
> The UNIX/Linux equivalent would be to keep the meta-data in the file's 
> inode alongside the access permissions

File attributes have existed on ext* filesystems for a very long time. 

> and to modify the file copy and move operations

There is no file copy operation on the OS level. The kernel just sees
that a process is creating and writing a new file. It doesn't know
whether this process intends this new file to be an identical copy of
some other file.

rename(2) of course preserves file attributes, because it doesn't change
the file at all (except the ctime entry), only the directories linking
to it.

cp, rsync, tar, etc. have options to copy the attributes along with
the "normal" content. But the problem is that there are a lot of
utilities working on files and they would all have to be modified. 
And worse, there isn't any standard for using those attributes, so
nobody uses them, so there is little incentive to modify them.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19946

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-11-25 18:05 +0000
Message-ID	<k8tmlj$6jr$3@localhost.localdomain>
In reply to	#19924

On Sun, 25 Nov 2012 10:18:49 +0100, Peter J. Holzer wrote:

> 
> File attributes have existed on ext* filesystems for a very long time.
>
Yes, but only pretty basic ones. Here we're talking about hypothetically 
storing stuff like character encoding or, as I suggested, the record and 
key definitions for an indexed sequential file or a DBMS table.

As I said, I know OSen that do this type of thing: it works well and even 
supports things like letting compilers access the metadata. This lets 
things like special C preprocessors generate #includes from it, COBOL 
COPY statements access it directly and ODBC/JDBC drivers to use it at 
runtime. 
  
> There is no file copy operation on the OS level. The kernel just sees
> that a process is creating and writing a new file. It doesn't know
> whether this process intends this new file to be an identical copy of
> some other file.
>
Of course, but if the metadata is external to the file as it is in the 
'other fork' in an Apple filing system, you still have to make sure that 
cp, mv and friends have all been rewritten to handle that. You may well 
find that its easier to pull metadata management into the kernel because 
then you've only got one piece of code to maintain rather than tweaks in 
umpteen utility programs and libraries.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#19998

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-27 19:51 +0100
Message-ID	<slrnkba2tp.k8a.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19946

On 2012-11-25 18:05, Martin Gregorie <martin@address-in-sig.invalid> wrote:
> On Sun, 25 Nov 2012 10:18:49 +0100, Peter J. Holzer wrote:
>> File attributes have existed on ext* filesystems for a very long time.
>>
> Yes, but only pretty basic ones.

They are arbitrary key/value pairs. You can put any information there,
there is no restriction to "basic" information (whatever that might be).
They are limited to a single block (typically 4kB), though, so MIME
type, character set, keywords, etc. are ok, but a thumbnail image might
be problematic.

> Here we're talking about hypothetically storing stuff like character
> encoding

This one is even somewhat standardized: user.charset is documented on
http://www.freedesktop.org/wiki/CommonExtendedAttributes which probably
means that some GUI programs are actually using it (besides the Apache
module where it originated).

To return to the topic of this group: Is there a Java library for
setting and retrieving xattrs?

>> There is no file copy operation on the OS level. The kernel just sees
>> that a process is creating and writing a new file. It doesn't know
>> whether this process intends this new file to be an identical copy of
>> some other file.
>>
> Of course, but if the metadata is external to the file as it is in the 
> 'other fork' in an Apple filing system, you still have to make sure that 
> cp, mv and friends have all been rewritten to handle that.

Why "but"? That's exactly what I wrote. The kernel doesn't know what the
a process is intending to do with a file, therefore programs like cp,
tar, etc. must be rewritten to handle xattrs explicitely. (And many of
them have been rewritten, of course. Xattrs aren't new)

> You may well find that its easier to pull metadata management into the
> kernel because then you've only got one piece of code to maintain
> rather than tweaks in umpteen utility programs and libraries.

The problem is that this just doesn't fit into the Unix system call
scheme. There is no "copy" system call. The kernel just sees that a
process opens one file for reading and another file for writing. It
cannot assume that this process wants to copy the metadata from the
first to the second file. Of course Linux could introduce such a system
call, but then those umpteen utility programs and libraries would still
have to be modified to use that new system call.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#20015

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-11-29 02:22 +0000
Message-ID	<k96gtp$eia$1@localhost.localdomain>
In reply to	#19998

On Tue, 27 Nov 2012 19:51:37 +0100, Peter J. Holzer wrote:

> The problem is that this just doesn't fit into the Unix system call
> scheme. There is no "copy" system call. The kernel just sees that a
> process opens one file for reading and another file for writing. It
> cannot assume that this process wants to copy the metadata from the
> first to the second file.
>
Of course.

> Of course Linux could introduce such a system
> call, but then those umpteen utility programs and libraries would still
> have to be modified to use that new system call.
>
I can see two ways of handling it: 

(1) introduce a pair of systems calls to retrieve and store the metadata 
associated with a file, and, yes, programs would need modification, but 
the amount would be trivial because you'd be looking at one extra line of 
code per file involved in the metadata transfer.

(2) alternatively it may be possible to do the job by adding a mode or to 
to the file opening operations. If they were defaulted appropriately, 
many programs could silently copy the metadata along with the data and/or 
automagically apply the appropriate transforms, such as charset 
transforms, during the transfer.

Thinking about it a little more, (2) is definitely the best solution 
because it would be rather useful to be able to default the metadata 
applied to a new file with a similar mechanism to that used for the 
permission bits. This sort of thing would be much easier to manage it is 
was built in to the filing system.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#20039

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-12-02 13:02 +0100
Message-ID	<slrnkbmgqj.loc.hjp-usenet2@hrunkner.hjp.at>
In reply to	#20015

On 2012-11-29 02:22, Martin Gregorie <martin@address-in-sig.invalid> wrote:
> On Tue, 27 Nov 2012 19:51:37 +0100, Peter J. Holzer wrote:
>> The problem is that this just doesn't fit into the Unix system call
>> scheme. There is no "copy" system call. The kernel just sees that a
>> process opens one file for reading and another file for writing. It
>> cannot assume that this process wants to copy the metadata from the
>> first to the second file.
>>
> Of course.
>
>> Of course Linux could introduce such a system
>> call, but then those umpteen utility programs and libraries would still
>> have to be modified to use that new system call.
>>
> I can see two ways of handling it: 
>
> (1) introduce a pair of systems calls to retrieve and store the metadata 
> associated with a file,

There are of course already system calls to do that (how else would you
get at the data?). There are four of them (list, get, set, remove),
however, not two, so ...

> and, yes, programs would need modification, but the amount would be
> trivial because you'd be looking at one extra line of code per file
> involved in the metadata transfer.

... it's 3 extra lines, not 1. Not including error handling, of course.

But I don't think that's the problem. The problem is that a) you have to
do it and b) you have to think about how to do it. Plus there is no
consensus that it should be done at all (user_xattr isn't even enabled
by default on ext*). Microsoft and Apple have it easier: If they say
that some information has to be stored in an alternate stream/resource
fork, programmers will do it. Linux has no central authority which can
force programmers to do anything.

> (2) alternatively it may be possible to do the job by adding a mode or to 
> to the file opening operations.

You mean an optional 4th parameter to open(2)?

> If they were defaulted appropriately, many programs could silently
> copy the metadata along with the data

I still don't see how that could work. That implies that the kernel
somehow guesses that you want to use the metadata from some file you
opened for reading for the file you are just opening for writing. While
that would be the right behaviour for "cp" or similar programs, it doubt
it would be right for the majority of programs.

It also raises the question of what the kernel should do if the process
doesn't have the necessary privileges to set some xattrs (or if the file
system doesn't support them). Fail? Silently drop them? I don't think
the kernel should make that decision. It's up to the application to
decide what's sensible ("mechanism, not policy" was a guiding principle
in the design of the Unix system call interface).

> and/or automagically apply the appropriate transforms, such as charset
> transforms, during the transfer.

That again makes no sense at the unix system call interface which deals
only with byte streams. 

It does however make a lot of sense for higher level interfaces. So
it might be a good idea for java.io.FileReader to check the user.charset
xattr of the file and apply the appropriate encoding.

> Thinking about it a little more, (2) is definitely the best solution 
> because it would be rather useful to be able to default the metadata 
> applied to a new file with a similar mechanism to that used for the 
> permission bits.

umask(2) is actually pretty broken IMHO.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#20044

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-12-02 19:36 +0000
Message-ID	<k9gajc$t6t$1@localhost.localdomain>
In reply to	#20039

On Sun, 02 Dec 2012 13:02:27 +0100, Peter J. Holzer wrote:

> On 2012-11-29 02:22, Martin Gregorie <martin@address-in-sig.invalid>
> wrote:
>> (2) alternatively it may be possible to do the job by adding a mode or
>> to to the file opening operations.
> 
> You mean an optional 4th parameter to open(2)?
>
No, what I said - an extra mode or two. If you didn't want the defaults 
you'd OR them with the other modes.
 
> I still don't see how that could work. That implies that the kernel
> somehow guesses that you want to use the metadata from some file you
> opened for reading for the file you are just opening for writing. While
> that would be the right behaviour for "cp" or similar programs, it doubt
> it would be right for the majority of programs.
>
It obviously wouldn't apply if the other file was stdin/stdout/stderr 
and, in fact many (most) programs that have a file open for reading and 
another for writing would probably want to copy the metadata unless it 
was a compiler or something else that applies major transformations to 
the data its handling: in these cases you'd expect to specify the metadata 
explicitly or to use an OS predefined matedata set.

> It also raises the question of what the kernel should do if the process
> doesn't have the necessary privileges to set some xattrs (or if the file
> system doesn't support them). Fail?
>
Why would that be treated any differently to access privileges? If the 
requested combination of attributes are nonsensical (e.g. trying the 
write a binary stream to a file of keyed records, or violate an OS-
defined rule, the file simply wouldn't open.

> That again makes no sense at the unix system call interface which deals
> only with byte streams.
>
But, by definition, if you were using metadata to control the character 
encoding (which is where this discussion started) or to define the file 
as containing keyed, fixed field records, you would not be trying to 
write a byte stream. If you tried something like that I'd expect that 
either you'd get a compile time exception or for the file management 
subsystem to throw an error at runtime. The compile-time error would be 
preferable and is more or less what Java does.

Equally, if you were just diddling with the character encoding, that 
should just work unless you were attempting to use an unsupported or non-
sensible conversion. For instance:

- ASCII to one of the Windows code pages would leave 0x00 to 0x7f
  unchanged (though the high order bits may need to be modified) and
  simply change the metadata to tell consumers of the file what
  encoding to use.

- ASCII->EBCDIC and EBCDIC->ASCII would have to recode every byte.
  except that there are some characters ('{' and '}') which, IIRC are not
  part of the EBCDIC character set in at least some dialects.

- some transforms would be one way: ASCII to utf-8 is ok, but IIRC the
  reverse would fail and ISO 6 bit or Baudot to anything else should work
  but the reverse is probably not possible.
 
>> Thinking about it a little more, (2) is definitely the best solution
>> because it would be rather useful to be able to default the metadata
>> applied to a new file with a similar mechanism to that used for the
>> permission bits.
> 
> umask(2) is actually pretty broken IMHO.
>
IME it has few surprises unless you're moving files between users with 
different umasks.
 

I don't know if you've used OSen that support the sort of extreme metadata 
I'm talking about. I have and it can be rather convenient. Here's a 
couple of nice examples: 

- use the metadata to set the backup frequency for a file, the number
  of generations of the backup to be kept, and the number of parallel
  backups to be done.

- (for a print file) use metadata to specify the printer capabilities
  needed to print the file and the type of paper required. This could be
  used by the program to match its output to the available paper size
  (think A4 vs US Letter) as well as making sure that the output is
  sent to a printer with the right paper and capabilities to output it.
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#20047

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-12-02 23:52 +0100
Message-ID	<slrnkbnmua.24m.hjp-usenet2@hrunkner.hjp.at>
In reply to	#20044

On 2012-12-02 19:36, Martin Gregorie <martin@address-in-sig.invalid> wrote:
> On Sun, 02 Dec 2012 13:02:27 +0100, Peter J. Holzer wrote:
>> That again makes no sense at the unix system call interface which deals
>> only with byte streams.
>>
> But, by definition, if you were using metadata to control the character 
> encoding (which is where this discussion started) or to define the file 
> as containing keyed, fixed field records, you would not be trying to 
> write a byte stream.

We were obviously talking past each other. I was only talking about
mechanisms like xattr, alternate streams or resource forks, not about
revamping the whole unix file model.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#20049

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-12-02 23:08 +0000
Message-ID	<k9gn1q$1e9$1@localhost.localdomain>
In reply to	#20047

On Sun, 02 Dec 2012 23:52:58 +0100, Peter J. Holzer wrote:

> We were obviously talking past each other. I was only talking about
> mechanisms like xattr, alternate streams or resource forks, not about
> revamping the whole unix file model.
>
I think that's likely. 

I've obviously not seen the guts of OS X resource forks, but I doubt 
their implementation differs a lot from what I was talking about: if they 
are not part of the OS's file handling system[*] they'd require all that 
messy stuff to be implemented in system programs and user-level libraries 
that you've described.

I'm not advocating that Linux becomes OS/400 lite, just pointing out that 
metadata can be used in many ways, and that once you introduce the 
mechanism to transparently handle one attribute, such as character 
encodings, that there's quite a lot more that it could be used for.

[*] I'm deliberately not saying Kernel because a lot of file handling 
stuff has already moved out of the kernel. A certain resemblance to Mach 
is creeping into Linux, though so far it is not nearly so fine-grained.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#19930

From	Sven Köhler <remove-sven.koehler@gmail.com>
Date	2012-11-25 13:13 +0100
Message-ID	<ahega5F7sv0U1@mid.dfncis.de>
In reply to	#19859

Am 23.11.2012 02:25, schrieb Arne Vajhøj:
> It is a bad idea to have meta data in the file body. This meta data
> should be where the rest of meta data are.

Now which OS actually supports this idea?

Are you saying that XML is bad, because it contains metadata (i.e. the
encoding/charset) inside the file body?

[toc] | [prev] | [next] | [standalone]

#19947

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-11-25 18:07 +0000
Message-ID	<k8tmpd$6jr$4@localhost.localdomain>
In reply to	#19930

On Sun, 25 Nov 2012 13:13:25 +0100, Sven Köhler wrote:

> Am 23.11.2012 02:25, schrieb Arne Vajhøj:
>> It is a bad idea to have meta data in the file body. This meta data
>> should be where the rest of meta data are.
> 
> Now which OS actually supports this idea?
>
IBM's OS/400, the late lamented ICL VME/B and (probably) Apple's OS X and 
iOS
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#19865

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-23 16:33 +0100
Message-ID	<k8o50f$1q6$1@news.albasani.net>
In reply to	#19856

Hi,

If your files are HTML, then you can note the encoding in the
header, via a meta tag:

	<html>
	  <head>
	    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
	  </head>
	  <body>
	  </body>
	</html>
	http://de.wikipedia.org/wiki/Meta-Element#.C3.84quivalente_zu_HTTP-Kopfdaten

If your files are XML, then you can note the encoding in the
xml tag:

	<?xml version="1.0" encoding="ISO-8859-1"?>
	http://de.wikipedia.org/wiki/XML-Deklaration

If your file is plain text, you can insert a BOM, which allows to
automatically detect a couple of encoding. And skip the BOM during
reading. The BOM is:

	\uFEFF
	http://de.wikipedia.org/wiki/Byte_Order_Mark

Would this not cover your requirements?

Bye


Roedy Green schrieb:
> The problem with encodings is they are not attached in any way or
> embedded in any way in a file. You are just supposed to know how a
> file is encoded.
>
> Here is my idea to solve the problem.
>
> We invent a new encoding.
>
> Files in this encoding begin with a 0 byte, then an ASCII string
> giving the name of a conventional encoding then another 0 byte.
>
> When you read a file with this encoding, the header is invisible to
> your application. When you write a file, a header for a UTF8 file gets
> written automatically.
>
> You write your app telling it to read and write this new encoding e.g.
> "labeled".
>
> You can write a utilty to import files into your labelled universe by
> detecting or guessing or being told the encoding.  It gets a header.
> Other than that the file is unmodified.
>

[toc] | [prev] | [next] | [standalone]

#19867

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-23 09:02 -0800
Message-ID	<9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com>
In reply to	#19865

On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
wrote, quoted or indirectly quoted someone who said :

>
>Would this not cover your requirements?

The problem is primarily raw text files with no indication of the
encoding.

The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

I feel angry about this. What asshole dreamed up the idea of
exchanging files in various encodings without any labelling of the
encoding? That there is no universal way of identifying the format of
a file is astounding.  Parents who thought this way would send their
kids out into the world not knowing their names, addresses, or
genders.

It sounds like something one of those people who live on beer and
pizza, with a roomful of old pizza boxes lying around would have come
up with.  I wish Martha Stewart had gone into programming.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [prev] | [next] | [standalone]

#19869

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-23 19:21 +0100
Message-ID	<k8oers$p98$1@news.albasani.net>
In reply to	#19867

Roedy Green schrieb:
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.

For example when you edit a HTML file locally, you don't
have this HTTP header information. Also where does the HTTP
header get the charset information in the first place?

Scenario 1:
- HTTP returns only mimetype=text/html without
    the chartset option.
- The browser then reads the HTML doc meta tag, and
    adjust the charset.

Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
    fetched from the HTML file meta tag.
- The browser does not read the HTML doc meta tag, and
    follows the charset found in the mimetype.

In both scenarios 1 + 2, the meta tag is used. Don't
know whether there is a scenario 3, and where should
this scenario take the encoding from?

Bye

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

A proposal to handle file encodings

Contents

#19856 — A proposal to handle file encodings

#19857

#19858

#19859

#19863

#19864

#19896

#19924

#19946

#19998

#20015

#20039

#20044

#20047

#20049

#19930

#19947

#19865

#19867

#19869