Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #15914 > unrolled thread
| Started by | lbrt chx _ gemale |
|---|---|
| First post | 2012-07-10 10:21 +0000 |
| Last post | 2012-07-12 10:43 -0400 |
| Articles | 4 — 4 participants |
Back to article view | Back to comp.lang.java.programmer
number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700
Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400
| From | lbrt chx _ gemale |
|---|---|
| Date | 2012-07-10 10:21 +0000 |
| Subject | number of bytes for each (uni)code point while using utf-8 as encoding ... |
| Message-ID | <1341915690.235464@nntp.aceinnovative.com> |
number of bytes for each (uni)code point while using utf-8 as encoding ...
~
you may iterate through all (uni)code points in a file encoded as utf-8 (or any other encoding) by going like this:
~
...
// __
String aOEnc = "UTF-8";
Charset InChrSt = Charset.forName(aOEnc);
CharsetDecoder InDec = InChrSt.newDecoder();
InDec.onMalformedInput(CodingErrorAction.REPORT);
InDec.onUnmappableCharacter(CodingErrorAction.REPORT);
// __
FIS = new FileInputStream(new File(<file path as string>));
FileChannel IFlChnl = FIS.getChannel();
MappedByteBuffer MptBytBfr = IFlChnl.map(FileChannel.MapMode.READ_ONLY, 0, (int)IFlChnl.size());
CharBuffer MptChrBfr = InDec.decode(MptBytBfr);
// __
for (int j = 0; (j < MptChrBfr.length()); ++j){
MptChrBfr.get();
}
...
~
each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
~
I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
~
How can you get the number of bytes you "get()"?
~
thank you
lbrtchx
comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...
[toc] | [next] | [standalone]
| From | Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> |
|---|---|
| Date | 2012-07-10 20:13 +0200 |
| Message-ID | <jthrd2$p5g$1@dont-email.me> |
| In reply to | #15914 |
On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote: > number of bytes for each (uni)code point while using utf-8 as encoding ... > <snip /> > each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right? > ~ > I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs > ~ > How can you get the number of bytes you "get()"? Well, UTF-8 always encodes the same char to the same (number of) bytes, doesn't it? So you could just build a map char -> size /a priori/. But really, what's the use? Knowing how big in bytes your text will be? Probably just as cheap to just write the text to a Writer backed by a counting /dev/null OutputStream. -- DF.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-07-11 19:04 -0700 |
| Message-ID | <taurv75nr6jvpenqqi91oprcj3liqp4k6g@4ax.com> |
| In reply to | #15914 |
On 10 Jul 2012 10:21:30 GMT, lbrt chx _ gemale wrote, quoted or indirectly quoted someone who said : >number of bytes for each (uni)code point while using utf-8 as encoding ... Let's assume there is something not quite right in the UTF-8 encoding of the file (or possibly the file is not even UTF-8). Read the file with a Reader and UTF-8 encoding. see http://mindprod.com/applet/fileio.html for the code. Then write the internal encoding back out to another file. See code at same place. Compare the files byte by byte till you figure out what is going on. Unicode has alternate ways of doing accents, with a single glyph and with a separate accent dead key. That may be nailing you. You might be adding/losing BOM marks. See http://mindprod.com/jgloss/bom.html though I have never seen Java insert or remove one. . -- Roedy Green Canadian Mind Products http://mindprod.com Mathematicians and computer scientists are far more interested in impressing you than informing you. If this were not so, the tutorials on building a robots.txt file, for example, would consist primarily of an annotated example. What you get instead are nothing but inscrutable adstract fragments in some obscure dialect of BNF.
[toc] | [prev] | [next] | [standalone]
| From | Jason Bailey <Jason.Bailey@sas.com> |
|---|---|
| Date | 2012-07-12 10:43 -0400 |
| Message-ID | <jtmnpl$m0i$1@foggy.unx.sas.com> |
| In reply to | #15914 |
There's an incorrect assumption here. CharBuffer.get returns a char. A
char can represent 1 or 2 bytes based on the encoding, but it is not a
codepoint. 2 chars are needed to represent the extended UTF-16.
If you want to determine how many bytes(either 1 or 2) that your char
represents, just do a comparison
boolena is2bytes = (MptChrBfr.get() >> 2) > 0;
Here you're taking the char and bit shifting it right twice. if there
are any values left, it would have required two bytes to create it.
if you want to know if the char you received is part of a bigger
codepoint. The Charachter class now has number or supporting methods.
Character.isHighSurrogate(MptChrBfr.get());
would tell you if it is a leading edge of a codepoint.
I'd look at the new methods on the Character and String class. dealing
with chars is a bit cumbersome. Just load everything into a string and
you can see the number of bytes that it takes up and if you want to know
the number of codepoints do a String.codePointCount
-jason
On 7/10/2012 6:21 AM, lbrt chx _ gemale wrote:
<snip>
> for (int j = 0; (j< MptChrBfr.length()); ++j){
> MptChrBfr.get();
> }
> ...
> ~
> each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~
> I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~
> How can you get the number of bytes you "get()"?
> ~
> thank you
> lbrtchx
> comp.lang.java.programmer: number of bytes for each (uni)code point while using utf-8 as encoding ...
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web