Groups > comp.lang.java.programmer > #15927

Re: number of bytes for each (uni)code point while using utf-8 as encoding ...

From	Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Newsgroups	comp.lang.java.programmer
Subject	Re: number of bytes for each (uni)code point while using utf-8 as encoding ...
Date	2012-07-10 22:42 +0200
Organization	A noiseless patient Spider
Message-ID	<jti43n$hpr$1@dont-email.me> (permalink)
References	<1341949507.184816@nntp.aceinnovative.com>

Show all headers | View raw

On 10/07/2012 21:45, lbrt chx _ gemale allegedly wrote:
>> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
> 
>>>  How can you get the number of bytes you "get()"?
> 
>> Well, UTF-8 always encodes the same char to the same (number of) bytes,
>> doesn't it?
> ~ 
>  What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
> ~ 
>  Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~ 
>> So you could just build a map char -> size /a priori/.
> ~ 
>  ...
> ~ 
>> But really, what's the use? ...
> ~ 
>  to you there is none but I am trying pinpoint the closest I possibly can:
> ~ 
>   .onMalformedInput(CodingErrorAction.REPORT);
>   .onUnmappableCharacter(CodingErrorAction.REPORT);
> ~ 
>  errors
> ~ 
>  There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures

And what's that knowledge about the mapping size going to tell you?

Assume the file is corrupted. Then you can't know the original character
(since it's corrupted). Hence even if you know to how many bytes each
character maps, you can't tell whether the size you're seeing is wrong
or right.

At least that's how it seems to me.

Even the malformedness is no reliable indicator. Your data might get
corrupted and the outcome be well-formed, as far as the character
encoding is concerned.

I have to agree with Lew. Only the transmission layer can reliably
tackle this problem. Just pass a checksum and be done with it.

-- 
DF.

Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400

csiph-web