Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19964

Re: A proposal to handle file encodings

From BGB <cr88192@hotmail.com>
Newsgroups comp.lang.java.programmer
Subject Re: A proposal to handle file encodings
Date 2012-11-25 17:54 -0600
Organization albasani.net
Message-ID <k8ub8s$97s$1@news.albasani.net> (permalink)
References <lb6ta81u9imfdtlpuesoc8slncju0ehsnm@4ax.com> <k8o50f$1q6$1@news.albasani.net> <9kava8lk1ignppq7rso7gmcb541gnerf8q@4ax.com> <k8tvkc$h9g$1@news.albasani.net> <k8u7dk$muc$1@dont-email.me>

Show all headers | View raw


On 11/25/2012 4:51 PM, Joshua Cranmer wrote:
> On 11/25/2012 2:36 PM, BGB wrote:
>> On 11/23/2012 11:02 AM, Roedy Green wrote:
>>> On Fri, 23 Nov 2012 16:33:40 +0100, Jan Burse <janburse@fastmail.fm>
>>> wrote, quoted or indirectly quoted someone who said :
>>>
>>>>
>>>> Would this not cover your requirements?
>>>
>>> The problem is primarily raw text files with no indication of the
>>> encoding.
>>>
>>> The HTML encoding is incompetent. You can't read it without knowing
>>> the encoding. It is just a confirmation. Thankfully the encoding comes
>>> in the HTTP header -- a case where meta information is available.
>>>
>>
>> it works as far as most usable encodings have ASCII as a subset, so
>> whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
>> can still be parsed.
>
> Well, there's also the minor issue that some encodings use the same name
> for slightly (or sometimes greatly) different variants--I think Big5 is
> an offender here in having a few different variants in mapping
> multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
> both laughably useless, since they pretend that the 8th bit is never set.
>

well, you only need to read far enough to read the header, then you can 
re-read in the needed encoding, if needed.

example:
assume ASCII, try to read header;
see that encoding says UTF-8 or 8859-1 or KOI-8R or whatever else;
reset, read again, "for real this time".


>> for UTF-16, there is typically the BOM, so if a BOM is seen, assume
>> UTF-16.
>
> In the HTML 5 specification (which is far closer to reality as far as
> HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other
> charset information, including what HTML claims the header is.
>

well, yes, partly. if you ignore the BOM and assume ASCII or 8859-1 or 
similar, then the document can't be parsed.


>> with some cleverness, it could probably also be extended to support
>> EBCEDIC, basically just try reading as EBCEDIC and see if it "makes
>> sense".
>
> I think EBCDIC is dead as far as web-compatibility is concerned, but the
> HTML 5 spec also specifies that the scanning for the <meta happens by
> looking for the ASCII octets in particular, so any non-ASCII-compatible
> charset (in particular, EBCDIC and UTF-7) is probably in practice
> unusable on the web.
>

pretty much, but not theoretically impossible at least.


> And, seriously, if you're designing a new format that contains textual
> data, require UTF-8.
>

this is pretty much what I do.
though not everywhere are things really clear cut as to whether it is 
plain ASCII or UTF-8, but this can be glossed over:
if it is textual, it is meant to be UTF-8, and falling short of this is 
an implementation issue.

I sometimes support UTF-16, but usually in these areas it is a shim to 
detect the BOM and convert the data to UTF-8, and other times the UTF-8 
is converted back to UTF-16 as-needed.


> [1] HTML 4.01 is a 13-year old specification which was never fully
> implemented by browsers and is laughably irrelevant for how modern
> browsers actually look at input. The HTML 5 specification, though still
> a draft, is much more grounded in reality, at least as far as how
> browsers are actually going to parse the mangled crap people claim is
> HTML; it was developed, in part, by reverse engineering what browsers
> actually DID and not rely on what an ancient spec said they should do.
>

makes sense.

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 13:36 -0800
  Re: A proposal to handle file encodings Joerg Meier <joergmmeier@arcor.de> - 2012-11-22 23:36 +0100
  Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 17:20 -0800
  Re: A proposal to handle file encodings Arne Vajhøj <arne@vajhoej.dk> - 2012-11-22 20:25 -0500
    Re: A proposal to handle file encodings markspace <-@.> - 2012-11-22 19:47 -0800
      Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 21:28 -0800
        Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-24 15:51 +0000
          Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:18 +0100
            Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:05 +0000
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:51 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-29 02:22 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 13:02 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 19:36 +0000
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-12-02 23:52 +0100
                Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-12-02 23:08 +0000
    Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 13:13 +0100
      Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-25 18:07 +0000
  Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 16:33 +0100
    Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-23 09:02 -0800
      Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-23 19:21 +0100
        Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:11 +0100
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 00:53 +0100
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 09:13 +0100
            Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:50 -0800
              Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 10:07 +0100
                Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 11:06 -0600
                Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-27 19:28 +0100
          Re: A proposal to handle file encodings Roedy Green <see_website@mindprod.com.invalid> - 2012-11-24 06:42 -0800
            Re: A proposal to handle file encodings "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-25 09:57 +0100
          Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:09 +0100
        Re: A proposal to handle file encodings Sven Köhler <remove-sven.koehler@gmail.com> - 2012-11-25 15:06 +0100
      Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-23 16:43 -0600
        Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-24 01:02 +0100
      Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 14:36 -0600
        Re: A proposal to handle file encodings Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-11-25 16:51 -0600
          Re: A proposal to handle file encodings BGB <cr88192@hotmail.com> - 2012-11-25 17:54 -0600
          Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:03 +0100
            Re: A proposal to handle file encodings Jan Burse <janburse@fastmail.fm> - 2012-11-26 02:20 +0100
              Re: A proposal to handle file encodings Martin Gregorie <martin@address-in-sig.invalid> - 2012-11-26 02:46 +0000

csiph-web