Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #15972
| From | Joshua Cranmer <Pidgeot18@verizon.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: number of bytes for each (uni)code point while using utf-8 as encoding ... |
| Date | 2012-07-12 00:03 -0400 |
| Organization | A noiseless patient Spider |
| Message-ID | <jtliab$r4f$1@dont-email.me> (permalink) |
| References | <1341949507.184816@nntp.aceinnovative.com> |
On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote: >> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote: > >>> How can you get the number of bytes you "get()"? > >> Well, UTF-8 always encodes the same char to the same (number of) bytes, >> doesn't it? > ~ > What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages > ~ > Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems > ~ I don't see how knowing the char -> length mapping is going to help you in this case. If your input is a blob of bytes which someone claims is UTF-8 but isn't, you can set up decoders to throw an error or at least instead of the replacement char (U+FFFD) which makes it detectable that someone screwed up. The problem also is, if it's not UTF-8, what is it then? The heuristics for this kind of stuff is incredibly squirrely and it more or less turns out that the most reliable way to fix it is to know the default charset of the computer spitting data out at you. Even then, there's still a possibility that its input was screwed up in a similar fashion: I've seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1 twice, so that every standard character ended up with 4 gibberish characters. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Find similar | Unroll thread
number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400
csiph-web