Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #25975
| From | Ken Wesson <kwesson@gmail.com> |
|---|---|
| Subject | Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? |
| Newsgroups | comp.lang.java.programmer |
| References | <iig4k2$sus$1@lust.ihug.co.nz> <iig6j2$dul$2@news.albasani.net> <iig84e$uqu$1@lust.ihug.co.nz> <4d4c2019$0$23753$14726298@news.sunsite.dk> <iihbuo$cqo$1@news.eternal-september.org> <iihhdo$emc$1@news.eternal-september.org> <alpine.DEB.1.10.1102042036190.11442@urchin.earth.li> |
| MIME-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | 8bit |
| NNTP-Posting-Host | $$-cwgml$lsc2q.news.x-privat.org |
| Message-ID | <4d4cc33b$1@news.x-privat.org> (permalink) |
| Date | 5 Feb 2011 04:25:47 +0100 |
| Organization | X-Privat.Org NNTP Server - http://www.x-privat.org |
| Lines | 64 |
| X-Authenticated-User | $$o-16a0wpsuhxkoyemw |
| X-Complaints-To | abuse@x-privat.org |
| Path | csiph.com!eeepc.pasdenom.info!news.pasdenom.info!news.dougwise.org!gegeweb.org!newsfeed.x-privat.org!x-privat.org!not-for-mail |
| Xref | csiph.com comp.lang.java.programmer:25975 |
Show key headers only | View raw
On Fri, 04 Feb 2011 21:30:57 +0000, Tom Anderson wrote: > On Fri, 4 Feb 2011, Joshua Cranmer wrote: > >>> "Arne Vajhøj" <arne@vajhoej.dk> wrote in message >>> >>>> But since codepoints above U+FFFF was added after the String class >>>> was defined, then the options on how to handle it were pretty >>>> limited. >> >> Extending to 24 bits is problematic because 24 bits opens you up to >> unaligned memory access on most, if not all, platforms, so you'd have >> to go fully up to 32 bits (this is what the codePoint methods in String >> et al. do). But considering the sheer amount of Strings in memory, >> going to 32-bit memory storage for Strings now doubles the size of that >> data... and can increase memory consumption in some cases by 30-40%. > > This is something i ponder quite a lot. > > It's essential that computers be able to represent characters from any > living human script. The astral planes include some such characters, > notably in the CJK extensions, without which it is impossible to write > some people's names correctly. The necessity of supporting more than > 2**16 codepoints is simply beyond question. > > The problem is how to do it efficiently. > > Going to strings of 24- or 32-bit characters would indeed be prohibitive > in its effect in memory. But isn't 16-bit already an eye-watering waste? > Most characters currently sitting in RAM around the world are, i would > wager, in the ASCII range: the great majority of characters in almost > any text in a latin script will be ASCII, in that they won't have > diacritics [1] (and most text is still in latin script), and almost all > characters in non-natural-language text (HTML and XML markup, > configuration files, filesystem paths) will be ASCII. A sizeable > fraction of non-latin text is still encodable in one byte per character, > using a national character set. Forcing all users of programs written in > Java (or any other platform which uses UCS-2 encoding) to spend two > bytes on each of those characters to ease the lives of the minority of > users who store a lot of CJK text seems wildly regressive. > > I am, however, at a loss to suggest a practical alternative! > > A question to the house, then: has anyone ever invented a data structure > for strings which allows space-efficient storage for strings in > different scripts, but also allows time-efficient implementation of the > common string operations? > > Upthread, Joshua mentions the idea of using UTF-8 strings, and cacheing > codepoint-to-bytepoint mappings. That's certainly an approach that would > work, although i worry about the performance effect of generating so > many writes, the difficulty of making it correct in multithreaded > systems, and the dependency on a good cache hit rate to make it pay off. > > Anyone else? I vote a hybrid-RLE approach: runs with the same high three bytes have a length, the high three bytes, and then all the low bytes of the run. For plain ASCII text that will mean <length of string> 0 0 0 <ASCII string>. A lot of other language texts will have long runs with a fixed pattern of high bytes, or long runs of 0 0 0 and the odd accented character. Limit run length to 255 so the length is always one byte. So every run gets four bytes added, instead of every *character* getting three.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar
Re: Why No Supplemental Characters In Character Literals? Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-04 19:59 +1300
Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 17:02 -0800
Re: Why No Supplemental Characters In Character Literals? Ken Wesson <kwesson@gmail.com> - 2011-02-05 04:21 +0100
Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 19:05 -0500
Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 19:56 -0500
Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 16:37 -0800
Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 00:22 -0800
Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:03 -0800
Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:04 -0500
Re: Why No Supplemental Characters In Character Literals? Lew <noone@lewscanon.com> - 2011-02-04 07:49 -0500
Re: Why No Supplemental Characters In Character Literals? Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-05 11:26 +1300
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 19:13 -0500
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 20:08 -0500
Re: Why No Supplemental Characters In Character Literals? Daniele Futtorovic <da.futt.news@laposte.net.invalid> - 2011-02-04 18:37 +0100
Re: Why No Supplemental Characters In Character Literals? markspace <nospam@nowhere.com> - 2011-02-04 11:27 -0800
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 17:28 -0500
Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 09:10 -0800
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:22 -0800
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:41 -0500
Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:12 -0500
Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Tom Anderson <twic@urchin.earth.li> - 2011-02-04 21:30 +0000
Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Ken Wesson <kwesson@gmail.com> - 2011-02-05 04:25 +0100
Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 12:33 -0500
Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 13:44 -0500
Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:08 -0800
Re: Why No Supplemental Characters In Character Literals? Lew <lew@lewscanon.com> - 2011-02-04 12:43 -0800
Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 10:49 -0500
Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 08:04 -0500
csiph-web