Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From: Joshua Cranmer <Pidgeot18@verizon.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: A proposal to handle file encodings
Date: Fri, 23 Nov 2012 16:43:44 -0600
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <k8ou7f$o3b$1@dont-email.me>
References: <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 23 Nov 2012 22:43:59 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="24683"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18ghKXwsiTqR4owoO5BOl6SHxkpmDl76WY="
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To: <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com>
Cancel-Lock: sha1:Qhg2OhNYL6aB/VfH0OZBL+4AYE8=
Xref: csiph.com comp.lang.java.programmer:19870

On 11/23/2012 11:02 AM, Roedy Green wrote:
> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
> wrote, quoted or indirectly quoted someone who said :
>
>>
>> Would this not cover your requirements?
>
> The problem is primarily raw text files with no indication of the
> encoding.
>
> The HTML encoding is incompetent. You can't read it without knowing
> the encoding. It is just a confirmation. Thankfully the encoding comes
> in the HTTP header -- a case where meta information is available.

Except that sometimes the HTTP header is wrong. I have seen enough 
UTF-8/ISO 8859-1 mojibake that I don't tend to place great confidence in 
metadata except at the most direct level in the protocol (e.g., though 
RFC 3977 dictates that NNTP transport is all done in UTF-8, I have 
enough experience to know that this is a fiction not borne by reality; 
but if I message says that it has an encoding of UTF-8 in its header, 
I'll trust that the message body is actually UTF-8).

In general, the optimal way to handle encoding in this modern day and 
age is the following is an extremely simple algorithm:
1. Always write out UTF-8.
2. When reading, if it doesn't fail to parse as UTF-8, assume it's 
UTF-8. Otherwise, assume it's the "platform default" (which generally 
means ISO 8859-1).

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth