Groups > comp.lang.java.programmer > #15943

number of bytes for each (uni)code point while using utf-8 as encoding ...

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer03.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad!not-for-mail
From	lbrt chx _ gemale
Newsgroups	comp.lang.java.programmer
Subject	number of bytes for each (uni)code point while using utf-8 as encoding ...
In-Reply-To	<1341965282.664308@nntp.aceinnovative.com>
X-Newsreader	NetComponents
Organization	Acecape, Inc.
Organization	Newshosting.com - Highest quality at a great price! www.newshosting.com
X-Complaints-To	abuse(at)newshosting.com
Message-ID	<1342030685.407730@nntp.aceinnovative.com> (permalink)
Cache-Post-Path	nntp.aceinnovative.com!unknown@p70-44.acedsl.com
X-Cache	nntpcache 3.0.1 (see http://www.nntpcache.org/)
Date	11 Jul 2012 18:18:05 GMT
Lines	36
X-Received-Bytes	1946
Xref	csiph.com comp.lang.java.programmer:15943

Show key headers only | View raw

>> how to get the length of the sequence of bytes defining a code point

>Use a look up table.
~ 
 Yes, rossum, this is what I was trying to get around ;-)
~ 
// __ unicode.org/versions/Unicode6.1.0/
 private final long[] lKpPntLims = new long[]{ 
           128
        , 2048
       , 65536
     , 2097152
    , 67108864
  , 2147483648L
 };

// __ 
 private final int getKdPntLBytes(long lKdPnt) throws IOException{
  int iByts = 0;
  boolean Is = false;
  for(; ((iByts < lKpPntLims.length) && !Is); ++iByts){
   Is = (lKdPnt < lKpPntLims[iByts]);
  }// iByts [0, lKpPntLims.length)
  if(!Is){ throw new IOException("// __ Code point not mapped by Unicode Standard 6.1.0! lKdPnt: |" + lKdPnt + "|"); }
  return(iByts);
 }
~ 
 The thing is that the constant casting gets expensive and even if you declare a function (and all its functional context) to be final, you have no guarantee that the compiler will inline it 
~ 
 IMO, I still think that this functionality should be part of the API or I just haven't found a way around it. I had even found silly one-off errors in presumably committed code:
~ 
 http://code.google.com/p/xbird/source/browse/trunk/xbird-open/main/src/java/xbird/util/codec/UTF8Codec.java
~ 
 and yes, I let them know
~ 
 lbrtchx

Back to comp.lang.java.programmer | Previous | Next | Find similar | Unroll thread

Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-11 18:18 +0000

csiph-web