Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From: Joshua Cranmer <Pidgeot18@verizon.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: A proposal to handle file encodings
Date: Sun, 25 Nov 2012 16:51:15 -0600
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <k8u7dk$muc$1@dont-email.me>
References: <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> <k8tvkc$h9g$1@news.albasani.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 25 Nov 2012 22:51:32 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="23500"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18fsp/VT3+PRaEmDMykHkVqV+YOwQxw9vM="
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To: <k8tvkc$h9g$1@news.albasani.net>
Cancel-Lock: sha1:ugZrGTkbZWNOF+yJhdVEhknqfNc=
Xref: csiph.com comp.lang.java.programmer:19962

On 11/25/2012 2:36 PM, BGB wrote:
> On 11/23/2012 11:02 AM, Roedy Green wrote:
>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
>> wrote, quoted or indirectly quoted someone who said :
>>
>>>
>>> Would this not cover your requirements?
>>
>> The problem is primarily raw text files with no indication of the
>> encoding.
>>
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding. It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.
>>
>
> it works as far as most usable encodings have ASCII as a subset, so
> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
> can still be parsed.

Well, there's also the minor issue that some encodings use the same name 
for slightly (or sometimes greatly) different variants--I think Big5 is 
an offender here in having a few different variants in mapping 
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are 
both laughably useless, since they pretend that the 8th bit is never set.

> for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

In the HTML 5 specification (which is far closer to reality as far as 
HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other 
charset information, including what HTML claims the header is.

> with some cleverness, it could probably also be extended to support
> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

I think EBCDIC is dead as far as web-compatibility is concerned, but the 
HTML 5 spec also specifies that the scanning for the <meta happens by 
looking for the ASCII octets in particular, so any non-ASCII-compatible 
charset (in particular, EBCDIC and UTF-7) is probably in practice 
unusable on the web.

And, seriously, if you're designing a new format that contains textual 
data, require UTF-8.

[1] HTML 4.01 is a 13-year old specification which was never fully 
implemented by browsers and is laughably irrelevant for how modern 
browsers actually look at input. The HTML 5 specification, though still 
a draft, is much more grounded in reality, at least as far as how 
browsers are actually going to parse the mangled crap people claim is 
HTML; it was developed, in part, by reverse engineering what browsers 
actually DID and not rely on what an ancient spec said they should do.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth