Groups > comp.lang.java.programmer > #15919

Re: number of bytes for each (uni)code point while using utf-8 as encoding ...

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From	Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid>
Newsgroups	comp.lang.java.programmer
Subject	Re: number of bytes for each (uni)code point while using utf-8 as encoding ...
Date	Tue, 10 Jul 2012 20:13:53 +0200
Organization	A noiseless patient Spider
Lines	18
Message-ID	<jthrd2$p5g$1@dont-email.me> (permalink)
References	<1341915690.235464@nntp.aceinnovative.com>
Mime-Version	1.0
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	7bit
Injection-Date	Tue, 10 Jul 2012 18:13:54 +0000 (UTC)
Injection-Info	mx04.eternal-september.org; posting-host="2be38c93c056892cf41d4f9f946b22e2"; logging-data="25776"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fmdVSy6SRMRztOSBS2udp"
User-Agent	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.27) Gecko/20120216 Thunderbird/3.1.19
In-Reply-To	<1341915690.235464@nntp.aceinnovative.com>
Cancel-Lock	sha1:5TpDWSAy8OgSmCFb3Xsw0Ksq/F0=
Xref	csiph.com comp.lang.java.programmer:15919

Show key headers only | View raw

On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
> number of bytes for each (uni)code point while using utf-8 as encoding ...
> <snip />
>  each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
> ~ 
>  I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
> ~ 
>  How can you get the number of bytes you "get()"?

Well, UTF-8 always encodes the same char to the same (number of) bytes,
doesn't it? So you could just build a map char -> size /a priori/.

But really, what's the use? Knowing how big in bytes your text will be?
Probably just as cheap to just write the text to a Writer backed by a
counting /dev/null OutputStream.

-- 
DF.

Thread

number of bytes for each (uni)code point while using utf-8 as encoding ... lbrt chx _ gemale - 2012-07-10 10:21 +0000
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Daniele Futtorovic <da.futt.news@laposte-dot-net.invalid> - 2012-07-10 20:13 +0200
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Roedy Green <see_website@mindprod.com.invalid> - 2012-07-11 19:04 -0700
  Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Jason Bailey <Jason.Bailey@sas.com> - 2012-07-12 10:43 -0400

csiph-web