Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #15990
| From | Jason Bailey <Jason.Bailey@sas.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: number of bytes for each (uni)code point while using utf-8 as encoding ... |
| Date | 2012-07-12 10:43 -0400 |
| Organization | SAS Inc. |
| Message-ID | <jtmnpl$m0i$1@foggy.unx.sas.com> (permalink) |
| References | <1341915690.235464@nntp.aceinnovative.com> |
There's an incorrect assumption here. CharBuffer.get returns a char. A
char can represent 1 or 2 bytes based on the encoding, but it is not a
codepoint. 2 chars are needed to represent the extended UTF-16.
If you want to determine how many bytes(either 1 or 2) that your char
represents, just do a comparison
boolena is2bytes = (MptChrBfr.get() >> 2) > 0;
Here you're taking the char and bit shifting it right twice. if there
are any values left, it would have required two bytes to create it.
if you want to know if the char you received is part of a bigger
codepoint. The Charachter class now has number or supporting methods.
Character.isHighSurrogate(MptChrBfr.get());
would tell you if it is a leading edge of a codepoint.
I'd look at the new methods on the Character and String class. dealing
with chars is a bit cumbersome. Just load everything into a string and
you can see the number of bytes that it takes up and if you want to know
the number of codepoints do a String.codePointCount
-jason
On 7/10/2012 6:21 AM, lbrt chx _ gemale wrote:
<snip>
> for (int j = 0; (j< MptChrBfr.length()); ++j){
> MptChrBfr.get();
> }
> ...
> ~
> each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~
> I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~
> How can you get the number of bytes you "get()"?
> ~
> thank you
> lbrtchx
> comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Find similar | Unroll thread
number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700 Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400
csiph-web