Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19962

Re: A proposal to handle file encodings

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From Joshua Cranmer <Pidgeot18@verizon.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: A proposal to handle file encodings
Date Sun, 25 Nov 2012 16:51:15 -0600
Organization A noiseless patient Spider
Lines 55
Message-ID <k8u7dk$muc$1@dont-email.me> (permalink)
References <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> <k8tvkc$h9g$1@news.albasani.net>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 7bit
Injection-Date Sun, 25 Nov 2012 22:51:32 +0000 (UTC)
Injection-Info mx04.eternal-september.org; posting-host="5a9707252ba5efb9bece56d1f4656a90"; logging-data="23500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fsp/VT3+PRaEmDMykHkVqV+YOwQxw9vM="
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0
In-Reply-To <k8tvkc$h9g$1@news.albasani.net>
Cancel-Lock sha1:ugZrGTkbZWNOF+yJhdVEhknqfNc=
Xref csiph.com comp.lang.java.programmer:19962

Show key headers only | View raw


On 11/25/2012 2:36 PM, BGB wrote:
> On 11/23/2012 11:02 AM, Roedy Green wrote:
>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
>> wrote, quoted or indirectly quoted someone who said :
>>
>>>
>>> Would this not cover your requirements?
>>
>> The problem is primarily raw text files with no indication of the
>> encoding.
>>
>> The HTML encoding is incompetent. You can't read it without knowing
>> the encoding. It is just a confirmation. Thankfully the encoding comes
>> in the HTTP header -- a case where meta information is available.
>>
>
> it works as far as most usable encodings have ASCII as a subset, so
> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
> can still be parsed.

Well, there's also the minor issue that some encodings use the same name 
for slightly (or sometimes greatly) different variants--I think Big5 is 
an offender here in having a few different variants in mapping 
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are 
both laughably useless, since they pretend that the 8th bit is never set.

> for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

In the HTML 5 specification (which is far closer to reality as far as 
HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other 
charset information, including what HTML claims the header is.

> with some cleverness, it could probably also be extended to support
> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

I think EBCDIC is dead as far as web-compatibility is concerned, but the 
HTML 5 spec also specifies that the scanning for the <meta happens by 
looking for the ASCII octets in particular, so any non-ASCII-compatible 
charset (in particular, EBCDIC and UTF-7) is probably in practice 
unusable on the web.

And, seriously, if you're designing a new format that contains textual 
data, require UTF-8.

[1] HTML 4.01 is a 13-year old specification which was never fully 
implemented by browsers and is laughably irrelevant for how modern 
browsers actually look at input. The HTML 5 specification, though still 
a draft, is much more grounded in reality, at least as far as how 
browsers are actually going to parse the mangled crap people claim is 
HTML; it was developed, in part, by reverse engineering what browsers 
actually DID and not rely on what an ancient spec said they should do.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
  Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
  Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
  Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
    Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
      Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
        Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
          Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
            Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
    Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
      Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
  Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
    Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
      Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
        Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
            Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
                Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
          Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
          Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
        Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
      Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
        Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
      Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
        Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
          Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
              Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000

csiph-web