Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #25975

Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals?

From Ken Wesson <kwesson@gmail.com>
Subject Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals?
Newsgroups comp.lang.java.programmer
References <iig4k2$sus$1@lust.ihug.co.nz> <iig6j2$dul$2@news.albasani.net> <iig84e$uqu$1@lust.ihug.co.nz> <4d4c2019$0$23753$14726298@news.sunsite.dk> <iihbuo$cqo$1@news.eternal-september.org> <iihhdo$emc$1@news.eternal-september.org> <alpine.DEB.1.10.1102042036190.11442@urchin.earth.li>
MIME-Version 1.0
Content-Type text/plain; charset=UTF-8
Content-Transfer-Encoding 8bit
NNTP-Posting-Host $$-cwgml$lsc2q.news.x-privat.org
Message-ID <4d4cc33b$1@news.x-privat.org> (permalink)
Date 5 Feb 2011 04:25:47 +0100
Organization X-Privat.Org NNTP Server - http://www.x-privat.org
Lines 64
X-Authenticated-User $$o-16a0wpsuhxkoyemw
X-Complaints-To abuse@x-privat.org
Path csiph.com!eeepc.pasdenom.info!news.pasdenom.info!news.dougwise.org!gegeweb.org!newsfeed.x-privat.org!x-privat.org!not-for-mail
Xref csiph.com comp.lang.java.programmer:25975

Show key headers only | View raw


On Fri, 04 Feb 2011 21:30:57 +0000, Tom Anderson wrote:

> On Fri, 4 Feb 2011, Joshua Cranmer wrote:
> 
>>> "Arne Vajhøj" <arne@vajhoej.dk> wrote in message
>>>
>>>> But since codepoints above U+FFFF was added after the String class
>>>> was defined, then the options on how to handle it were pretty
>>>> limited.
>>
>> Extending to 24 bits is problematic because 24 bits opens you up to
>> unaligned memory access on most, if not all, platforms, so you'd have
>> to go fully up to 32 bits (this is what the codePoint methods in String
>> et al. do). But considering the sheer amount of Strings in memory,
>> going to 32-bit memory storage for Strings now doubles the size of that
>> data... and can increase memory consumption in some cases by 30-40%.
> 
> This is something i ponder quite a lot.
> 
> It's essential that computers be able to represent characters from any
> living human script. The astral planes include some such characters,
> notably in the CJK extensions, without which it is impossible to write
> some people's names correctly. The necessity of supporting more than
> 2**16 codepoints is simply beyond question.
> 
> The problem is how to do it efficiently.
> 
> Going to strings of 24- or 32-bit characters would indeed be prohibitive
> in its effect in memory. But isn't 16-bit already an eye-watering waste?
> Most characters currently sitting in RAM around the world are, i would
> wager, in the ASCII range: the great majority of characters in almost
> any text in a latin script will be ASCII, in that they won't have
> diacritics [1] (and most text is still in latin script), and almost all
> characters in non-natural-language text (HTML and XML markup,
> configuration files, filesystem paths) will be ASCII. A sizeable
> fraction of non-latin text is still encodable in one byte per character,
> using a national character set. Forcing all users of programs written in
> Java (or any other platform which uses UCS-2 encoding) to spend two
> bytes on each of those characters to ease the lives of the minority of
> users who store a lot of CJK text seems wildly regressive.
> 
> I am, however, at a loss to suggest a practical alternative!
> 
> A question to the house, then: has anyone ever invented a data structure
> for strings which allows space-efficient storage for strings in
> different scripts, but also allows time-efficient implementation of the
> common string operations?
> 
> Upthread, Joshua mentions the idea of using UTF-8 strings, and cacheing
> codepoint-to-bytepoint mappings. That's certainly an approach that would
> work, although i worry about the performance effect of generating so
> many writes, the difficulty of making it correct in multithreaded
> systems, and the dependency on a good cache hit rate to make it pay off.
> 
> Anyone else?

I vote a hybrid-RLE approach: runs with the same high three bytes have a 
length, the high three bytes, and then all the low bytes of the run. For 
plain ASCII text that will mean <length of string> 0 0 0 <ASCII string>. 
A lot of other language texts will have long runs with a fixed pattern of 
high bytes, or long runs of 0 0 0 and the odd accented character. Limit 
run length to 255 so the length is always one byte. So every run gets 
four bytes added, instead of every *character* getting three.

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Re: Why No Supplemental Characters In Character Literals? Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-04 19:59 +1300
  Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 17:02 -0800
    Re: Why No Supplemental Characters In Character Literals? Ken Wesson <kwesson@gmail.com> - 2011-02-05 04:21 +0100
  Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 19:05 -0500
    Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 19:56 -0500
    Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 16:37 -0800
  Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 00:22 -0800
    Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:03 -0800
    Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:04 -0500
    Re: Why No Supplemental Characters In Character Literals? Lew <noone@lewscanon.com> - 2011-02-04 07:49 -0500
    Re: Why No Supplemental Characters In Character Literals? Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-05 11:26 +1300
  Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 19:13 -0500
    Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 20:08 -0500
  Re: Why No Supplemental Characters In Character Literals? Daniele Futtorovic <da.futt.news@laposte.net.invalid> - 2011-02-04 18:37 +0100
    Re: Why No Supplemental Characters In Character Literals? markspace <nospam@nowhere.com> - 2011-02-04 11:27 -0800
  Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 17:28 -0500
  Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 09:10 -0800
    Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:22 -0800
      Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:41 -0500
    Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:12 -0500
    Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Tom Anderson <twic@urchin.earth.li> - 2011-02-04 21:30 +0000
      Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Ken Wesson <kwesson@gmail.com> - 2011-02-05 04:25 +0100
    Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 12:33 -0500
    Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 13:44 -0500
      Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:08 -0800
  Re: Why No Supplemental Characters In Character Literals? Lew <lew@lewscanon.com> - 2011-02-04 12:43 -0800
  Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 10:49 -0500
  Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 08:04 -0500

csiph-web