Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #15914
| From | lbrt chx _ gemale |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | number of bytes for each (uni)code point while using utf-8 as encoding ... |
| Organization | Acecape, Inc. |
| Organization | Newshosting.com - Highest quality at a great price! www.newshosting.com |
| Message-ID | <1341915690.235464@nntp.aceinnovative.com> (permalink) |
| Date | 2012-07-10 10:21 +0000 |
number of bytes for each (uni)code point while using utf-8 as encoding ...
~
you may iterate through all (uni)code points in a file encoded as utf-8 (or any other encoding) by going like this:
~
...
// __
String aOEnc = "UTF-8";
Charset InChrSt = Charset.forName(aOEnc);
CharsetDecoder InDec = InChrSt.newDecoder();
InDec.onMalformedInput(CodingErrorAction.REPORT);
InDec.onUnmappableCharacter(CodingErrorAction.REPORT);
// __
FIS = new FileInputStream(new File(<file path as string>));
FileChannel IFlChnl = FIS.getChannel();
MappedByteBuffer MptBytBfr = IFlChnl.map(FileChannel.MapMode.READ_ONLY, 0, (int)IFlChnl.size());
CharBuffer MptChrBfr = InDec.decode(MptBytBfr);
// __
for (int j = 0; (j < MptChrBfr.length()); ++j){
MptChrBfr.get();
}
...
~
each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
~
I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
~
How can you get the number of bytes you "get()"?
~
thank you
lbrtchx
comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...
Back to comp.lang.java.programmer | Previous | Next — Next in thread | Find similar | Unroll thread
number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400
csiph-web