Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #15926
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: number of bytes for each (uni)code point while using utf-8 as encoding ... |
| Date | 2012-07-10 12:57 -0700 |
| Organization | http://groups.google.com |
| Message-ID | <69b079ab-0272-46f5-aeb1-42f9fad69d8c@googlegroups.com> (permalink) |
| References | <1341949507.184816@nntp.aceinnovative.com> |
On Tuesday, July 10, 2012 12:45:07 PM UTC-7, (unknown) wrote: > > On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote: > > > > How can you get the number of bytes you "get()"? > > > Well, UTF-8 always encodes the same char to the same (number of) bytes, > > doesn't it? > ~ > What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages > ~ > Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems > ~ > > So you could just build a map char -> size /a priori/. > ~ > ... > ~ > > But really, what's the use? ... > ~ > to you there is none but I am trying pinpoint the closest I possibly can: > ~ > .onMalformedInput(CodingErrorAction.REPORT); > .onUnmappableCharacter(CodingErrorAction.REPORT); > ~ > errors > ~ > There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures It isn't the job of the file format to correct errors but of the transmission protocol. Are you saying "quite a few files get corrupted" when reading directly from disk or over some other wire protocol? If it's from disk, I'd blame the disk drive not Java. You aren't going to fix a bad disk with good programming. -- Lew
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 19:45 +0000
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 12:57 -0700
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 22:42 +0200
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Lew <lewbloch@gmail.com> - 2012-07-10 14:17 -0700
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Joshua Cranmer <Pidgeot18@verizon.invalid> - 2012-07-12 00:03 -0400
csiph-web