Groups > comp.lang.java.programmer > #15990

Re: number of bytes for each (uni)code point while using utf-8 as encoding ...

From	Jason Bailey <Jason.Bailey@sas.com>
Newsgroups	comp.lang.java.programmer
Subject	Re: number of bytes for each (uni)code point while using utf-8 as encoding ...
Date	2012-07-12 10:43 -0400
Organization	SAS Inc.
Message-ID	<jtmnpl$m0i$1@foggy.unx.sas.com> (permalink)
References	<1341915690.235464@nntp.aceinnovative.com>

Show all headers | View raw

There's an incorrect assumption here. CharBuffer.get returns a char. A 
char can represent 1 or 2 bytes based on the encoding, but it is not a 
codepoint. 2 chars are needed to represent the extended UTF-16.

If you want to determine how many bytes(either 1 or 2) that your char 
represents, just do a comparison

boolena is2bytes = (MptChrBfr.get() >> 2) > 0;

Here you're taking the char and bit shifting it right twice. if there 
are any values left, it would have required two bytes to create it.

if you want to know if the char you received is part of a bigger 
codepoint. The Charachter class now has number or supporting methods.

Character.isHighSurrogate(MptChrBfr.get());

would tell you if it is a leading edge of a codepoint.

I'd look at the new methods on the Character and String class. dealing 
with chars is a bit cumbersome. Just load everything into a string and 
you can see the number of bytes that it takes up and if you want to know 
the number of codepoints do a String.codePointCount

-jason

On 7/10/2012 6:21 AM, lbrt chx _ gemale wrote:

<snip>
>      for (int j = 0; (j<  MptChrBfr.length()); ++j){
>       MptChrBfr.get();
>      }
>   ...
> ~
>   each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~
>   I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~
>   How can you get the number of bytes you "get()"?
> ~
>   thank you
>   lbrtchx
>   comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...

Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400

csiph-web