Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!216.196.98.144.MISMATCH!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: Lew Newsgroups: comp.lang.java.programmer Subject: Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Date: Tue, 10 Jul 2012 14:17:59 -0700 (PDT) Organization: http://groups.google.com Lines: 64 Message-ID: References: <1341949507.184816@nntp.aceinnovative.com> NNTP-Posting-Host: 69.28.149.29 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1341955191 20235 127.0.0.1 (10 Jul 2012 21:19:51 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Tue, 10 Jul 2012 21:19:51 +0000 (UTC) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.28.149.29; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T User-Agent: G2/1.0 Xref: csiph.com comp.lang.java.programmer:15929 Daniele Futtorovic wrote: > lbrt chx _ gemale allegedly wrote: > lbrt chx _ gemale allegedly wrote: > >=20 > >>> How can you get the number of bytes you "get()"? > >=20 > >> Well, UTF-8 always encodes the same char to the same (number of)= bytes, > >> doesn't it? > > ~=20 > > What about files, which (author's) claim to be UTF-8 encoded bu= t they aren't, and/or get somehow corrupted in transit? There are quite= a bit of "monkeys" (us) messing with the metadata headers of htm= l pages > > ~=20 > > Sometimes you must double check every file you keep in a text bank/= corpus, because, through associations, one mistake may propagate and create= other kinds of problems > > ~=20 > >> So you could just build a map char -> size /a priori/. > > ~=20 > > ... > > ~=20 > >> But really, what's the use? ... > > ~=20 > > to you there is none but I am trying pinpoint the closest I possibl= y can: > > ~=20 > > .onMalformedInput(CodingErrorAction.REPORT); > > .onUnmappableCharacter(CodingErrorAction.REPORT); > > ~=20 > > errors > > ~=20 > > There should be a way to get sizes as you get UTF-8 encoded sequenc= es from a file. Also I how found that quite a few files get corrupted while= in transmission and sometimes I wonder how safe that naive mapping you men= tion is, since those file formatting don't have any kind of built-in er= ror correction measures >=20 > And what's that knowledge about the mapping size going to tell you? >=20 > Assume the file is corrupted. Then you can't know the original charac= ter > (since it's corrupted). Hence even if you know to how many bytes each > character maps, you can't tell whether the size you're seeing is = wrong > or right. >=20 > At least that's how it seems to me. >=20 > Even the malformedness is no reliable indicator. Your data might get > corrupted and the outcome be well-formed, as far as the character > encoding is concerned. >=20 > I have to agree with Lew. Only the transmission layer can reliably > tackle this problem. Just pass a checksum and be done with it. Even the file being corrupt has no bearing on the correctness of the Java= =20 code. The file itself may actually be corrupt and the Java code yet=20 working perfectly. --=20 Lew