Groups > comp.lang.java.programmer > #15914 > unrolled thread

number of bytes for each (uni)code point while using utf-8 as encoding ...

Started by	lbrt chx _ gemale
First post	2012-07-10 10:21 +0000
Last post	2012-07-12 10:43 -0400
Articles	4 — 4 participants

Back to article view | Back to comp.lang.java.programmer

  number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700
    Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400

#15914 — number of bytes for each (uni)code point while using utf-8 as encoding ...

From	lbrt chx _ gemale
Date	2012-07-10 10:21 +0000
Subject	number of bytes for each (uni)code point while using utf-8 as encoding ...
Message-ID	<1341915690.235464@nntp.aceinnovative.com>

number of bytes for each (uni)code point while using utf-8 as encoding ...
~ 
 you may iterate through all (uni)code points in a file encoded as utf-8 (or any other encoding) by going like this:
~ 
 ...
// __ 
    String aOEnc = "UTF-8";
    Charset InChrSt = Charset.forName(aOEnc);
    CharsetDecoder InDec = InChrSt.newDecoder();
    InDec.onMalformedInput(CodingErrorAction.REPORT);
    InDec.onUnmappableCharacter(CodingErrorAction.REPORT);
// __ 
    FIS = new FileInputStream(new File(<file path as string>));
    FileChannel IFlChnl = FIS.getChannel();
    MappedByteBuffer MptBytBfr = IFlChnl.map(FileChannel.MapMode.READ_ONLY, 0, (int)IFlChnl.size());
    CharBuffer MptChrBfr = InDec.decode(MptBytBfr);
// __ 
    for (int j = 0; (j < MptChrBfr.length()); ++j){
     MptChrBfr.get();
    }
 ...
~ 
 each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
~ 
 I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
~ 
 How can you get the number of bytes you "get()"?
~ 
 thank you
 lbrtchx
 comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...

[toc] | [next] | [standalone]

#15919

From	Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Date	2012-07-10 20:13 +0200
Message-ID	<jthrd2$p5g$1@dont-email.me>
In reply to	#15914

On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
> number of bytes for each (uni)code point while using utf-8 as encoding ...
> <snip />
>  each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~ 
>  I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~ 
>  How can you get the number of bytes you "get()"?

Well, UTF-8 always encodes the same char to the same (number of) bytes,
doesn't it? So you could just build a map char -> size /a priori/.

But really, what's the use? Knowing how big in bytes your text will be?
Probably just as cheap to just write the text to a Writer backed by a
counting /dev/null OutputStream.

-- 
DF.

[toc] | [prev] | [next] | [standalone]

#15969

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-07-11 19:04 -0700
Message-ID	<taurv75nr6jvpenqqi91oprcj3liqp4k6g@4ax.com>
In reply to	#15914

On 10 Jul 2012 10:21:30 GMT, lbrt chx _ gemale wrote, quoted or
indirectly quoted someone who said :

>number of bytes for each (uni)code point while using utf-8 as encoding ...

Let's assume there is something not quite right in the UTF-8 encoding
of the file (or possibly the file is not even UTF-8).

Read the file with a Reader and UTF-8 encoding.
see http://mindprod.com/applet/fileio.html for the code.

Then write the internal encoding back out to another file. See code at
same place.

Compare the files byte by byte till you figure out what is going on.

Unicode has alternate ways of doing accents, with a single glyph and
with a separate accent dead key. That may be nailing you. You might be
adding/losing BOM marks.  See http://mindprod.com/jgloss/bom.html
though I have never seen Java insert or remove one.
. 



-- 
Roedy Green Canadian Mind Products
http://mindprod.com
Mathematicians and computer scientists are far more interested 
in impressing you than informing you. If this were not
so, the tutorials on building a robots.txt file, for example,
would consist primarily of an annotated example. What you get 
instead are nothing but inscrutable adstract fragments in some 
obscure dialect of BNF.

[toc] | [prev] | [next] | [standalone]

#15990

From	Jason Bailey <Jason.Bailey@sas.com>
Date	2012-07-12 10:43 -0400
Message-ID	<jtmnpl$m0i$1@foggy.unx.sas.com>
In reply to	#15914

There's an incorrect assumption here. CharBuffer.get returns a char. A 
char can represent 1 or 2 bytes based on the encoding, but it is not a 
codepoint. 2 chars are needed to represent the extended UTF-16.

If you want to determine how many bytes(either 1 or 2) that your char 
represents, just do a comparison

boolena is2bytes = (MptChrBfr.get() >> 2) > 0;

Here you're taking the char and bit shifting it right twice. if there 
are any values left, it would have required two bytes to create it.

if you want to know if the char you received is part of a bigger 
codepoint. The Charachter class now has number or supporting methods.

Character.isHighSurrogate(MptChrBfr.get());

would tell you if it is a leading edge of a codepoint.

I'd look at the new methods on the Character and String class. dealing 
with chars is a bit cumbersome. Just load everything into a string and 
you can see the number of bytes that it takes up and if you want to know 
the number of codepoints do a String.codePointCount

-jason

On 7/10/2012 6:21 AM, lbrt chx _ gemale wrote:

<snip>
>      for (int j = 0; (j<  MptChrBfr.length()); ++j){
>       MptChrBfr.get();
>      }
>   ...
> ~
>   each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~
>   I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~
>   How can you get the number of bytes you "get()"?
> ~
>   thank you
>   lbrtchx
>   comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...

[toc] | [prev] | [standalone]

csiph-web

number of bytes for each (uni)code point while using utf-8 as encoding ...

Contents

#15914 — number of bytes for each (uni)code point while using utf-8 as encoding ...

#15919

#15969

#15990