Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Daniele Futtorovic Newsgroups: comp.lang.java.programmer Subject: Re: number of bytes for each (uni)code point while using utf-8 as encoding ... Date: Tue, 10 Jul 2012 20:13:53 +0200 Organization: A noiseless patient Spider Lines: 18 Message-ID: References: <1341915690.235464@nntp.aceinnovative.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Injection-Date: Tue, 10 Jul 2012 18:13:54 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="2be38c93c056892cf41d4f9f946b22e2"; logging-data="25776"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fmdVSy6SRMRztOSBS2udp" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.27) Gecko/20120216 Thunderbird/3.1.19 In-Reply-To: <1341915690.235464@nntp.aceinnovative.com> Cancel-Lock: sha1:5TpDWSAy8OgSmCFb3Xsw0Ksq/F0= Xref: csiph.com comp.lang.java.programmer:15919 On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote: > number of bytes for each (uni)code point while using utf-8 as encoding ... > > each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right? > ~ > I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs > ~ > How can you get the number of bytes you "get()"? Well, UTF-8 always encodes the same char to the same (number of) bytes, doesn't it? So you could just build a map char -> size /a priori/. But really, what's the use? Knowing how big in bytes your text will be? Probably just as cheap to just write the text to a Writer backed by a counting /dev/null OutputStream. -- DF.