Groups > comp.lang.java.programmer > #19856 > unrolled thread

A proposal to handle file encodings

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2012-11-22 13:36 -0800
Last post	2012-11-26 02:46 +0000
Articles	19 on this page of 39 — 10 participants

Back to article view | Back to comp.lang.java.programmer

  A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
    Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
    Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
    Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
      Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
        Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
          Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
              Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
                  Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
                    Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
                      Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
                        Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
                          Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
      Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
        Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
    Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
      Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
        Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
          Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
              Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
                  Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
                    Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
            Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
            Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
          Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
        Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
        Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
          Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
            Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
              Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000

Page 2 of 2 — ← Prev page 1 [2]

#19871

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-24 00:11 +0100
Message-ID	<slrnkb00l8.jbt.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19869

On 2012-11-23 18:21, Jan Burse <janburse@fastmail.fm> wrote:
> Roedy Green schrieb:
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding.

Not true in practice. Almost all encodings used in the real world are
some superset of ASCII, and you only need to recognize ASCII characters
to find the relevant meta tag.

>> It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.
[...]
> Scenario 2:
> - HTTP returns mimetype=text/html; charset=<encoding>
>     fetched from the HTML file meta tag.

Which web server does this? I think CERN httpd did, back in the 1990's,
but I don't think any of the current crop of servers does, at least not
without some extra plugins. Normally the charset is taken from the
server config.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19873

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-24 00:53 +0100
Message-ID	<k8p2a9$3os$1@news.albasani.net>
In reply to	#19871

Peter J. Holzer schrieb:
>> Scenario 2:
>> >- HTTP returns mimetype=text/html; charset=<encoding>
>> >     fetched from the HTML file meta tag.
> Which web server does this? I think CERN httpd did, back in the 1990's,
> but I don't think any of the current crop of servers does, at least not
> without some extra plugins. Normally the charset is taken from the
> server config.

Its the only way to retrieve the charset:
http://tools.ietf.org/html/rfc2045#section-5.1

Its also the only way to set the chartset in dynamic pages.
For example in JSP one has to do the following:

<%@page contentType="text/html; charset=UTF-8" %>

There is a header field Content-Encoding, which
is not what Roedy wants I guess. Since the term
"Encoding" refers to compression here:
http://en.wikipedia.org/wiki/HTTP_compression

I guess Roedy wants the charset.

Bye

[toc] | [prev] | [next] | [standalone]

#19885

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-24 09:13 +0100
Message-ID	<slrnkb10cl.ufp.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19873

On 2012-11-23 23:53, Jan Burse <janburse@fastmail.fm> wrote:
> Peter J. Holzer schrieb:
>>> Scenario 2:
>>> >- HTTP returns mimetype=text/html; charset=<encoding>
>>> >     fetched from the HTML file meta tag.
>> Which web server does this? I think CERN httpd did, back in the 1990's,
>> but I don't think any of the current crop of servers does, at least not
>> without some extra plugins. Normally the charset is taken from the
>> server config.
>
> Its the only way to retrieve the charset:
> http://tools.ietf.org/html/rfc2045#section-5.1

That section defines the meaning of the Content-Type header, it doesn't
say anything about how that header is derived. It certainly doesn't say
anything about a web server (RFC 2045 is about mail, not web) extracting
the content type from an html file (the word "html" isn't even
mentioned).

> Its also the only way to set the chartset in dynamic pages.
> For example in JSP one has to do the following:
>
><%@page contentType="text/html; charset=UTF-8" %>

This is something completely different than
    <meta http-equiv="content-type" content="text/html; charset=...">

The former is a JSP directive which gets translated into some Java code
which sets the Content-Type header of the HTTP response (probably by
calling setContentType() of the ServletResponse object).

The latter is just an element of the HTML response. It is typically
interpreted by the browser (but only if no charset was specified in the
HTTP header), not by the server.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19894

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-24 06:50 -0800
Message-ID	<78n1b8tkmcbdefk2ifeeroklp93p88otma@4ax.com>
In reply to	#19873

On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <janburse@fastmail.fm>
wrote, quoted or indirectly quoted someone who said :

>I guess Roedy wants the charset.

In HTTP the meta information is in the HTTP header. This is all very
well except the that the server is just guessing. It is serving a
standard header for all documents with a given extension.  The meta
info needs to be in the document itself. Ditto for MIME type.

If  the document is transported compressed e.g. SPDY 
http://mindprod.com/jgloss/spdy.html 
and fluffed on the other end, then that compression is not part of the
document meta data. If it is kept around compressed, e.g. zip, then it
is.

When it arrives, and is saved on disk, the meta info needs to be
retained, so that an editor knows how to deal with it. The only way
you can do that is is if the meta info is embedded in the file.

The half-assed way we do things depends on the fact encodings are not
all that different.  You can get it wrong and still muddle through.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [prev] | [next] | [standalone]

#19923

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-25 10:07 +0100
Message-ID	<slrnkb3nus.qr8.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19894

On 2012-11-24 14:50, Roedy Green <see_website@mindprod.com.invalid> wrote:
> On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <janburse@fastmail.fm>
> wrote, quoted or indirectly quoted someone who said :
>>I guess Roedy wants the charset.
>
> In HTTP the meta information is in the HTTP header. This is all very
> well except the that the server is just guessing.

No. Normally it isn't guessing at all. It just uses the configured
charset.

> It is serving a standard header for all documents with a given
> extension.

Right. It is the responsibility of the server operator to make sure that
the extension matches the intended content-type. The server doesn't look
into the file to derive the content-type.

(For the "static files in a file system" case. Of course there are lots
of other cases, most prominently CMSs, where the finished HTML document
is assembled out of pieces stored in a database)

> The meta info needs to be in the document itself. Ditto for MIME type.

Then you wouldn't need a mime-type. That was invented precicely because
not all file formats are self-identifying. 

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19943

From	Joshua Cranmer <Pidgeot18@verizon.invalid>
Date	2012-11-25 11:06 -0600
Message-ID	<k8tj6n$ts4$1@dont-email.me>
In reply to	#19923

On 11/25/2012 3:07 AM, Peter J. Holzer wrote:
> On 2012-11-24 14:50, Roedy Green <see_website@mindprod.com.invalid> wrote:
>> On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <janburse@fastmail.fm>
>> wrote, quoted or indirectly quoted someone who said :
>>> I guess Roedy wants the charset.
>>
>> In HTTP the meta information is in the HTTP header. This is all very
>> well except the that the server is just guessing.
>
> No. Normally it isn't guessing at all. It just uses the configured
> charset.

And how is the configured charset not guessing? If a server is serving 
static files from a directory, I'm willing to bet that most 
administrators won't bother changing the default setting and instead 
will just hope that the default works.

I've had enough charset pains to know that much of it (particularly in 
en regions) are going to be people blindly using default settings. And I 
also know that not all tools agree on their default settings.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#19997

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-27 19:28 +0100
Message-ID	<slrnkba1hp.k8a.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19943

On 2012-11-25 17:06, Joshua Cranmer <Pidgeot18@verizon.invalid> wrote:
> On 11/25/2012 3:07 AM, Peter J. Holzer wrote:
>> On 2012-11-24 14:50, Roedy Green <see_website@mindprod.com.invalid> wrote:
>>> On Sat, 24 Nov 2012 00:53:51 +0100, Jan Burse <janburse@fastmail.fm>
>>> wrote, quoted or indirectly quoted someone who said :
>>>> I guess Roedy wants the charset.
>>>
>>> In HTTP the meta information is in the HTTP header. This is all very
>>> well except the that the server is just guessing.
>>
>> No. Normally it isn't guessing at all. It just uses the configured
>> charset.
>
> And how is the configured charset not guessing?

The server doesn't guess. It just does what it is told.

The admin may be guessing, though.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19893

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-24 06:42 -0800
Message-ID	<54n1b8hnhtb4693l7qsbvjelucf99kjnmf@4ax.com>
In reply to	#19871

On Sat, 24 Nov 2012 00:11:36 +0100, "Peter J. Holzer"
<hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
said :

>>> The HTML encoding is incompetent. You can't read it without knowing
>>> the encoding.
>
>Not true in practice. Almost all encodings used in the real world are
>some superset of ASCII, and you only need to recognize ASCII characters
>to find the relevant meta tag.

You still have the 8- 16- bit,which you can figure out with the BOM in
most cases.  It is still Mickey Mouse. The encoding should be at the
very front and encoded in ASCII or something fixed.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [prev] | [next] | [standalone]

#19922

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-25 09:57 +0100
Message-ID	<slrnkb3nc7.qr8.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19893

On 2012-11-24 14:42, Roedy Green <see_website@mindprod.com.invalid> wrote:
> On Sat, 24 Nov 2012 00:11:36 +0100, "Peter J. Holzer"
><hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
> said :
>>>> The HTML encoding is incompetent. You can't read it without knowing
>>>> the encoding.
>>
>>Not true in practice. Almost all encodings used in the real world are
>>some superset of ASCII, and you only need to recognize ASCII characters
>>to find the relevant meta tag.
>
> You still have the 8- 16- bit,which you can figure out with the BOM in
> most cases.

In this case the encoding is already known and the meta element must not
be used:

| The META declaration must only be used when the character encoding is
| organized such that ASCII-valued bytes stand for ASCII characters (at
| least until the META element is parsed).
    -- http://www.w3.org/TR/1999/REC-html401-19991224/charset.html

> It is still Mickey Mouse.

That wasn't your claim. Your claim was that it's impossible while all
browsers in the last 15 years or so have demonstrated that it is in
practice possible - on billions of web sites.

> The encoding should be at the very front and encoded in ASCII or
> something fixed.

It is encoded in ASCII, and it 

| should appear as early as possible in the HEAD element.
    -- http://www.w3.org/TR/1999/REC-html401-19991224/charset.html

And of course there is always the HTTP header. In fact your whole
proposal sounds like an extremely simplified version of the MIME header.
Which was invented 20 years ago and is widely used.

And frankly, you picked the least interesting aspect of MIME: You can
just require that UTF-8 is the only permissible encoding for plain text
files. That's much simpler and more likely to be implemented than
requiring the all text files must start with a header declaring the
encoding. At the same time you are missing out on other aspects of plain
text files (e.g., newline as line end vs. paragraph end, flowed) and of
course everything except plain text.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19940

From	Sven Köhler <remove-sven.koehler@gmail.com>
Date	2012-11-25 15:09 +0100
Message-ID	<ahen43F9fcbU2@mid.dfncis.de>
In reply to	#19871

Am 24.11.2012 00:11, schrieb Peter J. Holzer:
> On 2012-11-23 18:21, Jan Burse <janburse@fastmail.fm> wrote:
>> Roedy Green schrieb:
>>> The HTML encoding is incompetent. You can't read it without knowing
>>> the encoding.
> 
> Not true in practice. Almost all encodings used in the real world are
> some superset of ASCII, and you only need to recognize ASCII characters
> to find the relevant meta tag.

With the exception of UTF-16LE/BE for example.
Or is a BOM mandatory for UTF-16? The downside of BOMs is that they
break feature like includes. Many include mechanism just copy the
bytestream, this BOMs appear in the middle of the page.

Regards,
  Sven

[toc] | [prev] | [next] | [standalone]

#19939

From	Sven Köhler <remove-sven.koehler@gmail.com>
Date	2012-11-25 15:06 +0100
Message-ID	<ahemucF9fcbU1@mid.dfncis.de>
In reply to	#19869

Am 23.11.2012 19:21, schrieb Jan Burse:
> Roedy Green schrieb:
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding. It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.
> 
> For example when you edit a HTML file locally, you don't
> have this HTTP header information. Also where does the HTTP
> header get the charset information in the first place?
> 
> Scenario 1:
> - HTTP returns only mimetype=text/html without
>    the chartset option.
> - The browser then reads the HTML doc meta tag, and
>    adjust the charset.
> 
> Scenario 2:
> - HTTP returns mimetype=text/html; charset=<encoding>
>    fetched from the HTML file meta tag.
> - The browser does not read the HTML doc meta tag, and
>    follows the charset found in the mimetype.
> 
> In both scenarios 1 + 2, the meta tag is used. Don't
> know whether there is a scenario 3, and where should
> this scenario take the encoding from?

Scenario 3:

Apache configuration sets a default charset and sends Content-Type:
text/html; charset=iso-8859-1 even though the meta tag in the file
specifies utf8.

Luckily, this feature could be turned off. I'm not sure, what the
default config is at the moment. Also, I don't know of any webserver
that actually implements scenario 2. Mostly, specifying the charset in
the HTTP header is used by dynamic webpages (JSP, PHP, ASP), as they
allow setting the headers.


Also, why is this discussion in the Java newsgroup?
Just because Java asks programmer to specify the charset sometimes?


Regards,
  Sven

[toc] | [prev] | [next] | [standalone]

#19870

From	Joshua Cranmer <Pidgeot18@verizon.invalid>
Date	2012-11-23 16:43 -0600
Message-ID	<k8ou7f$o3b$1@dont-email.me>
In reply to	#19867

On 11/23/2012 11:02 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
> wrote, quoted or indirectly quoted someone who said :
>
>>
>> Would this not cover your requirements?
>
> The problem is primarily raw text files with no indication of the
> encoding.
>
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.

Except that sometimes the HTTP header is wrong. I have seen enough 
UTF-8/ISO 8859-1 mojibake that I don't tend to place great confidence in 
metadata except at the most direct level in the protocol (e.g., though 
RFC 3977 dictates that NNTP transport is all done in UTF-8, I have 
enough experience to know that this is a fiction not borne by reality; 
but if I message says that it has an encoding of UTF-8 in its header, 
I'll trust that the message body is actually UTF-8).

In general, the optimal way to handle encoding in this modern day and 
age is the following is an extremely simple algorithm:
1. Always write out UTF-8.
2. When reading, if it doesn't fail to parse as UTF-8, assume it's 
UTF-8. Otherwise, assume it's the "platform default" (which generally 
means ISO 8859-1).

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#19874

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-24 01:02 +0100
Message-ID	<k8p2q5$4qe$1@news.albasani.net>
In reply to	#19870

Joshua Cranmer schrieb:
>
> In general, the optimal way to handle encoding in this modern day and
> age is the following is an extremely simple algorithm:
> 1. Always write out UTF-8.
> 2. When reading, if it doesn't fail to parse as UTF-8, assume it's
> UTF-8. Otherwise, assume it's the "platform default" (which generally
> means ISO 8859-1).

This advice is only valid, if you cannot influence the charset
on the server side, via for example setting an appropriate mimetype. But 
otherwise it works perfectly fine.

What is a little bit annonying is that I didn't find a MimeType
decoder for the client side that easily delivers me the
charset parameter. So I had to write my own.

In the class comment of this custom decoder I wrote:

  * <p>Needed for pre JRE 1.5 code, since later in JRE 1.6 the
  * activation framework has been bundled and one can use
  * javax.activation.MimeType</p>

Just wrap your con.getContentType() into this class, and then
call getParameter().

Bye

[toc] | [prev] | [next] | [standalone]

#19955

From	BGB <cr88192@hotmail.com>
Date	2012-11-25 14:36 -0600
Message-ID	<k8tvkc$h9g$1@news.albasani.net>
In reply to	#19867

On 11/23/2012 11:02 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
> wrote, quoted or indirectly quoted someone who said :
>
>>
>> Would this not cover your requirements?
>
> The problem is primarily raw text files with no indication of the
> encoding.
>
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.
>

it works as far as most usable encodings have ASCII as a subset, so 
whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header 
can still be parsed.

for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

with some cleverness, it could probably also be extended to support 
EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

> I feel angry about this. What asshole dreamed up the idea of
> exchanging files in various encodings without any labelling of the
> encoding? That there is no universal way of identifying the format of
> a file is astounding.  Parents who thought this way would send their
> kids out into the world not knowing their names, addresses, or
> genders.
>
> It sounds like something one of those people who live on beer and
> pizza, with a roomful of old pizza boxes lying around would have come
> up with.  I wish Martha Stewart had gone into programming.
>

this is overdramatizing the issue.

at first I thought it was about binary formats, which can often be 
identified if-needed by checking for magic values (sometimes augmented 
with things like header-checksums, ... which can reduce likelihood of 
false-positives).

OTOH, one can get into the whole thing of container formats, where a 
glob of opaque binary data is often wrapped up in such a format with 
some identification of what it is. a typical example of such a container 
format are things like video-formats (AVI, MKV, MP4, OGG/OGM, ...), 
which may contain frames using any number of codecs, and may sometimes 
add additional capabilities, such as the ability to multiplex or 
interleave data chunks, ...

for some of my own stuff, I am using informal container formats loosely 
based on the JPEG file format (itself mostly based on a system of 
"markers"). it works...

or such...

[toc] | [prev] | [next] | [standalone]

#19962

From	Joshua Cranmer <Pidgeot18@verizon.invalid>
Date	2012-11-25 16:51 -0600
Message-ID	<k8u7dk$muc$1@dont-email.me>
In reply to	#19955

On 11/25/2012 2:36 PM, BGB wrote:
> On 11/23/2012 11:02 AM, Roedy Green wrote:
>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
>> wrote, quoted or indirectly quoted someone who said :
>>
>>>
>>> Would this not cover your requirements?
>>
>> The problem is primarily raw text files with no indication of the
>> encoding.
>>
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding. It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.
>>
>
> it works as far as most usable encodings have ASCII as a subset, so
> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
> can still be parsed.

Well, there's also the minor issue that some encodings use the same name 
for slightly (or sometimes greatly) different variants--I think Big5 is 
an offender here in having a few different variants in mapping 
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are 
both laughably useless, since they pretend that the 8th bit is never set.

> for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

In the HTML 5 specification (which is far closer to reality as far as 
HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other 
charset information, including what HTML claims the header is.

> with some cleverness, it could probably also be extended to support
> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

I think EBCDIC is dead as far as web-compatibility is concerned, but the 
HTML 5 spec also specifies that the scanning for the <meta happens by 
looking for the ASCII octets in particular, so any non-ASCII-compatible 
charset (in particular, EBCDIC and UTF-7) is probably in practice 
unusable on the web.

And, seriously, if you're designing a new format that contains textual 
data, require UTF-8.

[1] HTML 4.01 is a 13-year old specification which was never fully 
implemented by browsers and is laughably irrelevant for how modern 
browsers actually look at input. The HTML 5 specification, though still 
a draft, is much more grounded in reality, at least as far as how 
browsers are actually going to parse the mangled crap people claim is 
HTML; it was developed, in part, by reverse engineering what browsers 
actually DID and not rely on what an ancient spec said they should do.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#19964

From	BGB <cr88192@hotmail.com>
Date	2012-11-25 17:54 -0600
Message-ID	<k8ub8s$97s$1@news.albasani.net>
In reply to	#19962

On 11/25/2012 4:51 PM, Joshua Cranmer wrote:
> On 11/25/2012 2:36 PM, BGB wrote:
>> On 11/23/2012 11:02 AM, Roedy Green wrote:
>>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
>>> wrote, quoted or indirectly quoted someone who said :
>>>
>>>>
>>>> Would this not cover your requirements?
>>>
>>> The problem is primarily raw text files with no indication of the
>>> encoding.
>>>
>>> The HTML encoding is incompetent. You can't read it without knowing
>>> the encoding. It is just a confirmation. Thankfully the encoding comes
>>> in the HTTP header -- a case where meta information is available.
>>>
>>
>> it works as far as most usable encodings have ASCII as a subset, so
>> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
>> can still be parsed.
>
> Well, there's also the minor issue that some encodings use the same name
> for slightly (or sometimes greatly) different variants--I think Big5 is
> an offender here in having a few different variants in mapping
> multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
> both laughably useless, since they pretend that the 8th bit is never set.
>

well, you only need to read far enough to read the header, then you can 
re-read in the needed encoding, if needed.

example:
assume ASCII, try to read header;
see that encoding says UTF-8 or 8859-1 or KOI-8R or whatever else;
reset, read again, "for real this time".


>> for UTF-16, there is typically the BOM, so if a BOM is seen, assume
>> UTF-16.
>
> In the HTML 5 specification (which is far closer to reality as far as
> HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other
> charset information, including what HTML claims the header is.
>

well, yes, partly. if you ignore the BOM and assume ASCII or 8859-1 or 
similar, then the document can't be parsed.


>> with some cleverness, it could probably also be extended to support
>> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes
>> sense".
>
> I think EBCDIC is dead as far as web-compatibility is concerned, but the
> HTML 5 spec also specifies that the scanning for the <meta happens by
> looking for the ASCII octets in particular, so any non-ASCII-compatible
> charset (in particular, EBCDIC and UTF-7) is probably in practice
> unusable on the web.
>

pretty much, but not theoretically impossible at least.


> And, seriously, if you're designing a new format that contains textual
> data, require UTF-8.
>

this is pretty much what I do.
though not everywhere are things really clear cut as to whether it is 
plain ASCII or UTF-8, but this can be glossed over:
if it is textual, it is meant to be UTF-8, and falling short of this is 
an implementation issue.

I sometimes support UTF-16, but usually in these areas it is a shim to 
detect the BOM and convert the data to UTF-8, and other times the UTF-8 
is converted back to UTF-16 as-needed.


> [1] HTML 4.01 is a 13-year old specification which was never fully
> implemented by browsers and is laughably irrelevant for how modern
> browsers actually look at input. The HTML 5 specification, though still
> a draft, is much more grounded in reality, at least as far as how
> browsers are actually going to parse the mangled crap people claim is
> HTML; it was developed, in part, by reverse engineering what browsers
> actually DID and not rely on what an ancient spec said they should do.
>

makes sense.

[toc] | [prev] | [next] | [standalone]

#19969

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-26 02:03 +0100
Message-ID	<k8uf5u$gah$1@news.albasani.net>
In reply to	#19962

Joshua Cranmer schrieb:
>
> Well, there's also the minor issue that some encodings use the same name
> for slightly (or sometimes greatly) different variants--I think Big5 is
> an offender here in having a few different variants in mapping
> multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
> both laughably useless, since they pretend that the 8th bit is never set.

According to Wiki:

"Generally, this encoding form is rarely used, even on EBCDIC based 
mainframes for which it was designed. IBM EBCDIC based mainframe 
operating systems, like z/OS, usually use UTF-16 for complete Unicode 
support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit 
support UTF-16 on IBM mainframes."

http://en.wikipedia.org/wiki/UTF-EBCDIC

[toc] | [prev] | [next] | [standalone]

#19971

From	Jan Burse <janburse@fastmail.fm>
Date	2012-11-26 02:20 +0100
Message-ID	<k8ug5b$3f2$1@news.albasani.net>
In reply to	#19969

BTW: This is a nice read:
http://www.transbay.net/~enf/ascii/ascii.pdf

Shows history of ASCII, EBCDIC, ISO-646, etc..

Jan Burse schrieb:
> Joshua Cranmer schrieb:
>>
>> Well, there's also the minor issue that some encodings use the same name
>> for slightly (or sometimes greatly) different variants--I think Big5 is
>> an offender here in having a few different variants in mapping
>> multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
>> both laughably useless, since they pretend that the 8th bit is never set.
>
> According to Wiki:
>
> "Generally, this encoding form is rarely used, even on EBCDIC based
> mainframes for which it was designed. IBM EBCDIC based mainframe
> operating systems, like z/OS, usually use UTF-16 for complete Unicode
> support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit
> support UTF-16 on IBM mainframes."
>
> http://en.wikipedia.org/wiki/UTF-EBCDIC

[toc] | [prev] | [next] | [standalone]

#19976

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2012-11-26 02:46 +0000
Message-ID	<k8ul66$enu$2@localhost.localdomain>
In reply to	#19971

On Mon, 26 Nov 2012 02:20:42 +0100, Jan Burse wrote:

> BTW: This is a nice read: http://www.transbay.net/~enf/ascii/ascii.pdf
> 
> Shows history of ASCII, EBCDIC, ISO-646, etc..
> 
> Jan Burse schrieb:
>> Joshua Cranmer schrieb:
>>>
>>> Well, there's also the minor issue that some encodings use the same
>>> name for slightly (or sometimes greatly) different variants--I think
>>> Big5 is an offender here in having a few different variants in mapping
>>> multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
>>> both laughably useless, since they pretend that the 8th bit is never
>>> set.
>>
>> According to Wiki:
>>
>> "Generally, this encoding form is rarely used, even on EBCDIC based
>> mainframes for which it was designed. IBM EBCDIC based mainframe
>> operating systems, like z/OS, usually use UTF-16 for complete Unicode
>> support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML
>> toolkit support UTF-16 on IBM mainframes."
>>
>> http://en.wikipedia.org/wiki/UTF-EBCDIC

And don't forget that the EBCDIC bit patterns are in turn derived 
directly from an 80 column punched card. This is obvious if you've ever 
used a manual 12 key punch:

A-I are keyed by pressing 0  and 1-9
J-R are keyed by pressing 11 and 1-9
S-Z are keyed by pressing 12 and 1-8

so now you know why there are all those weird punctuation symbols in the 
gaps between I and J and between R and S when you arrange EDCDIC 
characters by ascending hexadecimal value. The EBCDIC code map is the 
result of somebody choosing an easy way to encode the outputs from the 12 
sensors in an 80 column card reader while retaining the traditional hole 
pattern a card punch made and, presumably, keeping to the sort sequence 
used by a card sorter so that the data prep people wouldn't get confused.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

A proposal to handle file encodings

Contents

#19871

#19873

#19885

#19894

#19923

#19943

#19997

#19893

#19922

#19940

#19939

#19870

#19874

#19955

#19962

#19964

#19969

#19971

#19976