Groups > comp.lang.java.programmer > #15924 > unrolled thread

number of bytes for each (uni)code point while using utf-8 as encoding ...

Started by	lbrt chx _ gemale
First post	2012-07-10 19:45 +0000
Last post	2012-07-12 00:03 -0400
Articles	5 — 4 participants

Back to article view | Back to comp.lang.java.programmer

  number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
      Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400

#15924 — number of bytes for each (uni)code point while using utf-8 as encoding ...

From	lbrt chx _ gemale
Date	2012-07-10 19:45 +0000
Subject	number of bytes for each (uni)code point while using utf-8 as encoding ...
Message-ID	<1341949507.184816@nntp.aceinnovative.com>

> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:

> >  How can you get the number of bytes you "get()"?

> Well, UTF-8 always encodes the same char to the same (number of) bytes,
> doesn't it?
~ 
 What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
~ 
 Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
~ 
> So you could just build a map char -> size /a priori/.
~ 
 ...
~ 
> But really, what's the use? ...
~ 
 to you there is none but I am trying pinpoint the closest I possibly can:
~ 
  .onMalformedInput(CodingErrorAction.REPORT);
  .onUnmappableCharacter(CodingErrorAction.REPORT);
~ 
 errors
~ 
 There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures
~ 
 lbrtchx

[toc] | [next] | [standalone]

#15926

From	Lew <lewbloch@gmail.com>
Date	2012-07-10 12:57 -0700
Message-ID	<69b079ab-0272-46f5-aeb1-42f9fad69d8c@googlegroups.com>
In reply to	#15924

On Tuesday, July 10, 2012 12:45:07 PM UTC-7, (unknown) wrote:
> &gt; On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
> 
> &gt; &gt;  How can you get the number of bytes you &quot;get()&quot;?
> 
> &gt; Well, UTF-8 always encodes the same char to the same (number of) bytes,
> &gt; doesn&#39;t it?
> ~ 
>  What about files, which (author&#39;s) claim to be UTF-8 encoded but they aren&#39;t, and/or get somehow corrupted in transit? There are quite a bit of &quot;monkeys&quot; (us) messing with the metadata headers of html pages
> ~ 
>  Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~ 
> &gt; So you could just build a map char -&gt; size /a priori/.
> ~ 
>  ...
> ~ 
> &gt; But really, what&#39;s the use? ...
> ~ 
>  to you there is none but I am trying pinpoint the closest I possibly can:
> ~ 
>   .onMalformedInput(CodingErrorAction.REPORT);
>   .onUnmappableCharacter(CodingErrorAction.REPORT);
> ~ 
>  errors
> ~ 
>  There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don&#39;t have any kind of built-in error correction measures

It isn't the job of the file format to correct errors but of the transmission protocol.

Are you saying "quite a few files get corrupted" when reading directly from disk 
or over some other wire protocol? If it's from disk, I'd blame the disk drive not 
Java.

You aren't going to fix a bad disk with good programming.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#15927

From	Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date	2012-07-10 22:42 +0200
Message-ID	<jti43n$hpr$1@dont-email.me>
In reply to	#15924

On 10/07/2012 21:45, lbrt chx _ gemale allegedly wrote:
>> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
> 
>>>  How can you get the number of bytes you "get()"?
> 
>> Well, UTF-8 always encodes the same char to the same (number of) bytes,
>> doesn't it?
> ~ 
>  What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
> ~ 
>  Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~ 
>> So you could just build a map char -> size /a priori/.
> ~ 
>  ...
> ~ 
>> But really, what's the use? ...
> ~ 
>  to you there is none but I am trying pinpoint the closest I possibly can:
> ~ 
>   .onMalformedInput(CodingErrorAction.REPORT);
>   .onUnmappableCharacter(CodingErrorAction.REPORT);
> ~ 
>  errors
> ~ 
>  There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures

And what's that knowledge about the mapping size going to tell you?

Assume the file is corrupted. Then you can't know the original character
(since it's corrupted). Hence even if you know to how many bytes each
character maps, you can't tell whether the size you're seeing is wrong
or right.

At least that's how it seems to me.

Even the malformedness is no reliable indicator. Your data might get
corrupted and the outcome be well-formed, as far as the character
encoding is concerned.

I have to agree with Lew. Only the transmission layer can reliably
tackle this problem. Just pass a checksum and be done with it.

-- 
DF.

[toc] | [prev] | [next] | [standalone]

#15929

From	Lew <lewbloch@gmail.com>
Date	2012-07-10 14:17 -0700
Message-ID	<d18b8ea9-1ec7-4098-9b77-eff3500bc14f@googlegroups.com>
In reply to	#15927

Daniele Futtorovic wrote:
> lbrt chx _ gemale allegedly wrote:
> lbrt chx _ gemale allegedly wrote:
> &gt; 
> &gt;&gt;&gt;  How can you get the number of bytes you &quot;get()&quot;?
> &gt; 
> &gt;&gt; Well, UTF-8 always encodes the same char to the same (number of) bytes,
> &gt;&gt; doesn&#39;t it?
> &gt; ~ 
> &gt;  What about files, which (author&#39;s) claim to be UTF-8 encoded but they aren&#39;t, and/or get somehow corrupted in transit? There are quite a bit of &quot;monkeys&quot; (us) messing with the metadata headers of html pages
> &gt; ~ 
> &gt;  Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> &gt; ~ 
> &gt;&gt; So you could just build a map char -&gt; size /a priori/.
> &gt; ~ 
> &gt;  ...
> &gt; ~ 
> &gt;&gt; But really, what&#39;s the use? ...
> &gt; ~ 
> &gt;  to you there is none but I am trying pinpoint the closest I possibly can:
> &gt; ~ 
> &gt;   .onMalformedInput(CodingErrorAction.REPORT);
> &gt;   .onUnmappableCharacter(CodingErrorAction.REPORT);
> &gt; ~ 
> &gt;  errors
> &gt; ~ 
> &gt;  There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don&#39;t have any kind of built-in error correction measures
> 
> And what&#39;s that knowledge about the mapping size going to tell you?
> 
> Assume the file is corrupted. Then you can&#39;t know the original character
> (since it&#39;s corrupted). Hence even if you know to how many bytes each
> character maps, you can&#39;t tell whether the size you&#39;re seeing is wrong
> or right.
> 
> At least that&#39;s how it seems to me.
> 
> Even the malformedness is no reliable indicator. Your data might get
> corrupted and the outcome be well-formed, as far as the character
> encoding is concerned.
> 
> I have to agree with Lew. Only the transmission layer can reliably
> tackle this problem. Just pass a checksum and be done with it.

Even the file being corrupt has no bearing on the correctness of the Java 
code. The file itself may actually be corrupt and the Java code yet 
working perfectly.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#15972

From	Joshua Cranmer <Pidgeot18@verizon.invalid>
Date	2012-07-12 00:03 -0400
Message-ID	<jtliab$r4f$1@dont-email.me>
In reply to	#15924

On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote:
>> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
>
>>>   How can you get the number of bytes you "get()"?
>
>> Well, UTF-8 always encodes the same char to the same (number of) bytes,
>> doesn't it?
> ~
>   What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
> ~
>   Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~

I don't see how knowing the char -> length mapping is going to help you 
in this case. If your input is a blob of bytes which someone claims is 
UTF-8 but isn't, you can set up decoders to throw an error or at least 
instead of the replacement char (U+FFFD) which makes it detectable that 
someone screwed up.

The problem also is, if it's not UTF-8, what is it then?  The heuristics 
for this kind of stuff is incredibly squirrely and it more or less turns 
out that the most reliable way to fix it is to know the default charset 
of the computer spitting data out at you. Even then, there's still a 
possibility that its input was screwed up in a similar fashion: I've 
seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1 
twice, so that every standard character ended up with 4 gibberish 
characters.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [standalone]

csiph-web

number of bytes for each (uni)code point while using utf-8 as encoding ...

Contents

#15924 — number of bytes for each (uni)code point while using utf-8 as encoding ...

#15926

#15927

#15929

#15972