Groups > comp.lang.java.programmer > #26171

Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals?

Date	2011-02-04 18:41 -0500
From	Arne Vajhøj <arne@vajhoej.dk>
Newsgroups	comp.lang.java.programmer
Subject	Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals?
References	(3 earlier) <4d4c2019$0$23753$14726298@news.sunsite.dk> <iihbuo$cqo$1@news.eternal-september.org> <iihhdo$emc$1@news.eternal-september.org> <alpine.DEB.1.10.1102042036190.11442@urchin.earth.li> <4p1pk6dv6fg1firm1hvvh3jqaga6l69rib@4ax.com>
Message-ID	<4d4c8ea6$0$23758$14726298@news.sunsite.dk> (permalink)
Organization	SunSITE.dk - Supporting Open source

Show all headers | View raw

On 04-02-2011 18:22, Roedy Green wrote:
> On Fri, 4 Feb 2011 21:30:57 +0000, Tom Anderson<twic@urchin.earth.li>
> wrote, quoted or indirectly quoted someone who said :
>
>> I am, however, at a loss to suggest a practical alternative!
>
> What might happen is strings are nominally 32-bit.
>
> You could probably come up with a very rapid compression scheme,
> similar to UTF-8 but with a bit more compression, that could be
> applied to strings at garbage collection time if they have not been
> referenced since the last GC sweep.
>
> String are immutable.  This admits some other flavours of
> "compression".
>
> If the high three bytes of the string are 0, store the string
> UNCOMPRESSED, as a string of bytes.  All the indexOf indexing
> arithmetic works identically.  This behaviour is hidden inside the
> JVM. The String class knows nothing about it. It is an implementation
> detail of 32-bit strings.
>
> If the high two bytes of the string are 0, store the string
> uncompressed as a string of unsigned shorts.
>
> if there are any one bits in the high 2 byte, store as a string of
> unsigned ints.
>
> Strings are what you gobble up your RAM with.  If we start supporting
> 32 bit chars, we have to do something to compensate for the doubling
> of RAM use.
>
>
> Short lived strings would still be 32-bit.  They would only be
> converted to the other forms if they have been sitting around for a
> while.  Interned strings would be immediately converted to canonical
> form.

indexOf works fine with compression, but substring and charAt becomes
rather expensive.

Arne

Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: Why No Supplemental Characters In Character Literals? "Mike Schilling" <mscottschilling@hotmail.com> - 2011-02-04 09:10 -0800
  Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:22 -0800
    Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:41 -0500
  Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 18:12 -0500
  Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Tom Anderson <twic@urchin.earth.li> - 2011-02-04 21:30 +0000
    Re: Efficient unicode string implementation was: Re: Why No Supplemental Characters In Character Literals? Ken Wesson <kwesson@gmail.com> - 2011-02-05 04:25 +0100
  Re: Why No Supplemental Characters In Character Literals? Arne Vajhøj <arne@vajhoej.dk> - 2011-02-04 12:33 -0500
  Re: Why No Supplemental Characters In Character Literals? Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-04 13:44 -0500
    Re: Why No Supplemental Characters In Character Literals? Roedy Green <see_website@mindprod.com.invalid> - 2011-02-04 15:08 -0800

csiph-web