Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post01.iad!not-for-mail From: lbrt chx _ gemale Newsgroups: comp.lang.java.programmer Subject: number of bytes for each (uni)code point while using utf-8 as encoding ... X-Newsreader: NetComponents Organization: Acecape, Inc. Organization: Newshosting.com - Highest quality at a great price! www.newshosting.com X-Complaints-To: abuse(at)newshosting.com Message-ID: <1341949507.184816@nntp.aceinnovative.com> Cache-Post-Path: nntp.aceinnovative.com!unknown@p70-44.acedsl.com X-Cache: nntpcache 3.0.1 (see http://www.nntpcache.org/) Date: 10 Jul 2012 19:45:07 GMT Lines: 27 X-Received-Bytes: 1832 Xref: csiph.com comp.lang.java.programmer:15924 > On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote: > > How can you get the number of bytes you "get()"? > Well, UTF-8 always encodes the same char to the same (number of) bytes, > doesn't it? ~ What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages ~ Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems ~ > So you could just build a map char -> size /a priori/. ~ ... ~ > But really, what's the use? ... ~ to you there is none but I am trying pinpoint the closest I possibly can: ~ .onMalformedInput(CodingErrorAction.REPORT); .onUnmappableCharacter(CodingErrorAction.REPORT); ~ errors ~ There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures ~ lbrtchx