Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!216.196.98.144.MISMATCH!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From: Lew <lewbloch@gmail.com>
Newsgroups: comp.lang.java.programmer
Subject: Re: number of bytes for each (uni)code point while using utf-8 as encoding ...
Date: Tue, 10 Jul 2012 14:17:59 -0700 (PDT)
Organization: http://groups.google.com
Lines: 64
Message-ID: <d18b8ea9-1ec7-4098-9b77-eff3500bc14f@googlegroups.com>
References: <1341949507.184816@nntp.aceinnovative.com> <jti43n$hpr$1@dont-email.me>
NNTP-Posting-Host: 69.28.149.29
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: posting.google.com 1341955191 20235 127.0.0.1 (10 Jul 2012 21:19:51 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Tue, 10 Jul 2012 21:19:51 +0000 (UTC)
In-Reply-To: <jti43n$hpr$1@dont-email.me>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.28.149.29; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T
User-Agent: G2/1.0
Xref: csiph.com comp.lang.java.programmer:15929

Daniele Futtorovic wrote:
> lbrt chx _ gemale allegedly wrote:
> lbrt chx _ gemale allegedly wrote:
> &gt;=20
> &gt;&gt;&gt;  How can you get the number of bytes you &quot;get()&quot;?
> &gt;=20
> &gt;&gt; Well, UTF-8 always encodes the same char to the same (number of)=
 bytes,
> &gt;&gt; doesn&#39;t it?
> &gt; ~=20
> &gt;  What about files, which (author&#39;s) claim to be UTF-8 encoded bu=
t they aren&#39;t, and/or get somehow corrupted in transit? There are quite=
 a bit of &quot;monkeys&quot; (us) messing with the metadata headers of htm=
l pages
> &gt; ~=20
> &gt;  Sometimes you must double check every file you keep in a text bank/=
corpus, because, through associations, one mistake may propagate and create=
 other kinds of problems
> &gt; ~=20
> &gt;&gt; So you could just build a map char -&gt; size /a priori/.
> &gt; ~=20
> &gt;  ...
> &gt; ~=20
> &gt;&gt; But really, what&#39;s the use? ...
> &gt; ~=20
> &gt;  to you there is none but I am trying pinpoint the closest I possibl=
y can:
> &gt; ~=20
> &gt;   .onMalformedInput(CodingErrorAction.REPORT);
> &gt;   .onUnmappableCharacter(CodingErrorAction.REPORT);
> &gt; ~=20
> &gt;  errors
> &gt; ~=20
> &gt;  There should be a way to get sizes as you get UTF-8 encoded sequenc=
es from a file. Also I how found that quite a few files get corrupted while=
 in transmission and sometimes I wonder how safe that naive mapping you men=
tion is, since those file formatting don&#39;t have any kind of built-in er=
ror correction measures
>=20
> And what&#39;s that knowledge about the mapping size going to tell you?
>=20
> Assume the file is corrupted. Then you can&#39;t know the original charac=
ter
> (since it&#39;s corrupted). Hence even if you know to how many bytes each
> character maps, you can&#39;t tell whether the size you&#39;re seeing is =
wrong
> or right.
>=20
> At least that&#39;s how it seems to me.
>=20
> Even the malformedness is no reliable indicator. Your data might get
> corrupted and the outcome be well-formed, as far as the character
> encoding is concerned.
>=20
> I have to agree with Lew. Only the transmission layer can reliably
> tackle this problem. Just pass a checksum and be done with it.

Even the file being corrupt has no bearing on the correctness of the Java=
=20
code. The file itself may actually be corrupt and the Java code yet=20
working perfectly.

--=20
Lew