Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!npeer03.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad!not-for-mail From: lbrt chx _ gemale Newsgroups: comp.lang.java.programmer Subject: number of bytes for each (uni)code point while using utf-8 as encoding ... In-Reply-To: <1341965282.664308@nntp.aceinnovative.com> X-Newsreader: NetComponents Organization: Acecape, Inc. Organization: Newshosting.com - Highest quality at a great price! www.newshosting.com X-Complaints-To: abuse(at)newshosting.com Message-ID: <1342030685.407730@nntp.aceinnovative.com> Cache-Post-Path: nntp.aceinnovative.com!unknown@p70-44.acedsl.com X-Cache: nntpcache 3.0.1 (see http://www.nntpcache.org/) Date: 11 Jul 2012 18:18:05 GMT Lines: 36 X-Received-Bytes: 1946 Xref: csiph.com comp.lang.java.programmer:15943 >> how to get the length of the sequence of bytes defining a code point >Use a look up table. ~ Yes, rossum, this is what I was trying to get around ;-) ~ // __ unicode.org/versions/Unicode6.1.0/ private final long[] lKpPntLims = new long[]{ 128 , 2048 , 65536 , 2097152 , 67108864 , 2147483648L }; // __ private final int getKdPntLBytes(long lKdPnt) throws IOException{ int iByts = 0; boolean Is = false; for(; ((iByts < lKpPntLims.length) && !Is); ++iByts){ Is = (lKdPnt < lKpPntLims[iByts]); }// iByts [0, lKpPntLims.length) if(!Is){ throw new IOException("// __ Code point not mapped by Unicode Standard 6.1.0! lKdPnt: |" + lKdPnt + "|"); } return(iByts); } ~ The thing is that the constant casting gets expensive and even if you declare a function (and all its functional context) to be final, you have no guarantee that the compiler will inline it ~ IMO, I still think that this functionality should be part of the API or I just haven't found a way around it. I had even found silly one-off errors in presumably committed code: ~ http://code.google.com/p/xbird/source/browse/trunk/xbird-open/main/src/java/xbird/util/codec/UTF8Codec.java ~ and yes, I let them know ~ lbrtchx