Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Joshua Cranmer Newsgroups: comp.lang.java.programmer Subject: Re: A proposal to handle file encodings Date: Fri, 23 Nov 2012 16:43:44 -0600 Organization: A noiseless patient Spider Lines: 32 Message-ID: References: <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 23 Nov 2012 22:43:59 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="24683"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ghKXwsiTqR4owoO5BOl6SHxkpmDl76WY=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 In-Reply-To: <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> Cancel-Lock: sha1:Qhg2OhNYL6aB/VfH0OZBL+4AYE8= Xref: csiph.com comp.lang.java.programmer:19870 On 11/23/2012 11:02 AM, Roedy Green wrote: > On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse > wrote, quoted or indirectly quoted someone who said : > >> >> Would this not cover your requirements? > > The problem is primarily raw text files with no indication of the > encoding. > > The HTML encoding is incompetent. You can't read it without knowing > the encoding. It is just a confirmation. Thankfully the encoding comes > in the HTTP header -- a case where meta information is available. Except that sometimes the HTTP header is wrong. I have seen enough UTF-8/ISO 8859-1 mojibake that I don't tend to place great confidence in metadata except at the most direct level in the protocol (e.g., though RFC 3977 dictates that NNTP transport is all done in UTF-8, I have enough experience to know that this is a fiction not borne by reality; but if I message says that it has an encoding of UTF-8 in its header, I'll trust that the message body is actually UTF-8). In general, the optimal way to handle encoding in this modern day and age is the following is an extremely simple algorithm: 1. Always write out UTF-8. 2. When reading, if it doesn't fail to parse as UTF-8, assume it's UTF-8. Otherwise, assume it's the "platform default" (which generally means ISO 8859-1). -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth