Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19964
| From | BGB <cr88192@hotmail.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: A proposal to handle file encodings |
| Date | 2012-11-25 17:54 -0600 |
| Organization | albasani.net |
| Message-ID | <k8ub8s$97s$1@news.albasani.net> (permalink) |
| References | <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> <k8tvkc$h9g$1@news.albasani.net> <k8u7dk$muc$1@dont-email.me> |
On 11/25/2012 4:51 PM, Joshua Cranmer wrote: > On 11/25/2012 2:36 PM, BGB wrote: >> On 11/23/2012 11:02 AM, Roedy Green wrote: >>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm> >>> wrote, quoted or indirectly quoted someone who said : >>> >>>> >>>> Would this not cover your requirements? >>> >>> The problem is primarily raw text files with no indication of the >>> encoding. >>> >>> The HTML encoding is incompetent. You can't read it without knowing >>> the encoding. It is just a confirmation. Thankfully the encoding comes >>> in the HTTP header -- a case where meta information is available. >>> >> >> it works as far as most usable encodings have ASCII as a subset, so >> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header >> can still be parsed. > > Well, there's also the minor issue that some encodings use the same name > for slightly (or sometimes greatly) different variants--I think Big5 is > an offender here in having a few different variants in mapping > multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are > both laughably useless, since they pretend that the 8th bit is never set. > well, you only need to read far enough to read the header, then you can re-read in the needed encoding, if needed. example: assume ASCII, try to read header; see that encoding says UTF-8 or 8859-1 or KOI-8R or whatever else; reset, read again, "for real this time". >> for UTF-16, there is typically the BOM, so if a BOM is seen, assume >> UTF-16. > > In the HTML 5 specification (which is far closer to reality as far as > HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other > charset information, including what HTML claims the header is. > well, yes, partly. if you ignore the BOM and assume ASCII or 8859-1 or similar, then the document can't be parsed. >> with some cleverness, it could probably also be extended to support >> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes >> sense". > > I think EBCDIC is dead as far as web-compatibility is concerned, but the > HTML 5 spec also specifies that the scanning for the <meta happens by > looking for the ASCII octets in particular, so any non-ASCII-compatible > charset (in particular, EBCDIC and UTF-7) is probably in practice > unusable on the web. > pretty much, but not theoretically impossible at least. > And, seriously, if you're designing a new format that contains textual > data, require UTF-8. > this is pretty much what I do. though not everywhere are things really clear cut as to whether it is plain ASCII or UTF-8, but this can be glossed over: if it is textual, it is meant to be UTF-8, and falling short of this is an implementation issue. I sometimes support UTF-16, but usually in these areas it is a shim to detect the BOM and convert the data to UTF-8, and other times the UTF-8 is converted back to UTF-16 as-needed. > [1] HTML 4.01 is a 13-year old specification which was never fully > implemented by browsers and is laughably irrelevant for how modern > browsers actually look at input. The HTML 5 specification, though still > a draft, is much more grounded in reality, at least as far as how > browsers are actually going to parse the mangled crap people claim is > HTML; it was developed, in part, by reverse engineering what browsers > actually DID and not rely on what an ancient spec said they should do. > makes sense.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000
csiph-web