Groups > comp.lang.java.programmer > #15924

number of bytes for each (uni)code point while using utf-8 as encoding ...

From	lbrt chx _ gemale
Newsgroups	comp.lang.java.programmer
Subject	number of bytes for each (uni)code point while using utf-8 as encoding ...
Organization	Acecape, Inc.
Organization	Newshosting.com - Highest quality at a great price! www.newshosting.com
Message-ID	<1341949507.184816@nntp.aceinnovative.com> (permalink)
Date	2012-07-10 19:45 +0000

Show all headers | View raw

> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:

> >  How can you get the number of bytes you "get()"?

> Well, UTF-8 always encodes the same char to the same (number of) bytes,
> doesn't it?
~ 
 What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
~ 
 Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
~ 
> So you could just build a map char -> size /a priori/.
~ 
 ...
~ 
> But really, what's the use? ...
~ 
 to you there is none but I am trying pinpoint the closest I possibly can:
~ 
  .onMalformedInput(CodingErrorAction.REPORT);
  .onUnmappableCharacter(CodingErrorAction.REPORT);
~ 
 errors
~ 
 There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures
~ 
 lbrtchx

Back to comp.lang.java.programmer | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400

csiph-web