Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #15972

Re: number of bytes for each (uni)code point while using utf-8 as encoding ...

From Joshua Cranmer <Pidgeot18@verizon.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: number of bytes for each (uni)code point while using utf-8 as encoding ...
Date 2012-07-12 00:03 -0400
Organization A noiseless patient Spider
Message-ID <jtliab$r4f$1@dont-email.me> (permalink)
References <1341949507.184816@nntp.aceinnovative.com>

Show all headers | View raw


On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote:
>> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
>
>>>   How can you get the number of bytes you "get()"?
>
>> Well, UTF-8 always encodes the same char to the same (number of) bytes,
>> doesn't it?
> ~
>   What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
> ~
>   Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~

I don't see how knowing the char -> length mapping is going to help you 
in this case. If your input is a blob of bytes which someone claims is 
UTF-8 but isn't, you can set up decoders to throw an error or at least 
instead of the replacement char (U+FFFD) which makes it detectable that 
someone screwed up.

The problem also is, if it's not UTF-8, what is it then?  The heuristics 
for this kind of stuff is incredibly squirrely and it more or less turns 
out that the most reliable way to fix it is to know the default charset 
of the computer spitting data out at you. Even then, there's still a 
possibility that its input was screwed up in a similar fashion: I've 
seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1 
twice, so that every standard character ended up with 4 gibberish 
characters.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400

csiph-web