Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Joshua Cranmer Newsgroups: comp.lang.java.programmer Subject: Re: A proposal to handle file encodings Date: Sun, 25 Nov 2012 16:51:15 -0600 Organization: A noiseless patient Spider Lines: 55 Message-ID: References: <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sun, 25 Nov 2012 22:51:32 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="23500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fsp/VT3+PRaEmDMykHkVqV+YOwQxw9vM=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 In-Reply-To: Cancel-Lock: sha1:ugZrGTkbZWNOF+yJhdVEhknqfNc= Xref: csiph.com comp.lang.java.programmer:19962 On 11/25/2012 2:36 PM, BGB wrote: > On 11/23/2012 11:02 AM, Roedy Green wrote: >> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse >> wrote, quoted or indirectly quoted someone who said : >> >>> >>> Would this not cover your requirements? >> >> The problem is primarily raw text files with no indication of the >> encoding. >> >> The HTML encoding is incompetent. You can't read it without knowing >> the encoding. It is just a confirmation. Thankfully the encoding comes >> in the HTTP header -- a case where meta information is available. >> > > it works as far as most usable encodings have ASCII as a subset, so > whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header > can still be parsed. Well, there's also the minor issue that some encodings use the same name for slightly (or sometimes greatly) different variants--I think Big5 is an offender here in having a few different variants in mapping multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are both laughably useless, since they pretend that the 8th bit is never set. > for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16. In the HTML 5 specification (which is far closer to reality as far as HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other charset information, including what HTML claims the header is. > with some cleverness, it could probably also be extended to support > EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense". I think EBCDIC is dead as far as web-compatibility is concerned, but the HTML 5 spec also specifies that the scanning for the