Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19962
| From | Joshua Cranmer <Pidgeot18@verizon.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: A proposal to handle file encodings |
| Date | 2012-11-25 16:51 -0600 |
| Organization | A noiseless patient Spider |
| Message-ID | <k8u7dk$muc$1@dont-email.me> (permalink) |
| References | <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> <k8tvkc$h9g$1@news.albasani.net> |
On 11/25/2012 2:36 PM, BGB wrote: > On 11/23/2012 11:02 AM, Roedy Green wrote: >> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm> >> wrote, quoted or indirectly quoted someone who said : >> >>> >>> Would this not cover your requirements? >> >> The problem is primarily raw text files with no indication of the >> encoding. >> >> The HTML encoding is incompetent. You can't read it without knowing >> the encoding. It is just a confirmation. Thankfully the encoding comes >> in the HTTP header -- a case where meta information is available. >> > > it works as far as most usable encodings have ASCII as a subset, so > whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header > can still be parsed. Well, there's also the minor issue that some encodings use the same name for slightly (or sometimes greatly) different variants--I think Big5 is an offender here in having a few different variants in mapping multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are both laughably useless, since they pretend that the 8th bit is never set. > for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16. In the HTML 5 specification (which is far closer to reality as far as HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other charset information, including what HTML claims the header is. > with some cleverness, it could probably also be extended to support > EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense". I think EBCDIC is dead as far as web-compatibility is concerned, but the HTML 5 spec also specifies that the scanning for the <meta happens by looking for the ASCII octets in particular, so any non-ASCII-compatible charset (in particular, EBCDIC and UTF-7) is probably in practice unusable on the web. And, seriously, if you're designing a new format that contains textual data, require UTF-8. [1] HTML 4.01 is a 13-year old specification which was never fully implemented by browsers and is laughably irrelevant for how modern browsers actually look at input. The HTML 5 specification, though still a draft, is much more grounded in reality, at least as far as how browsers are actually going to parse the mangled crap people claim is HTML; it was developed, in part, by reverse engineering what browsers actually DID and not rely on what an ancient spec said they should do. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000
csiph-web