Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <kt3kpt$nc4$1@ger.gmane.org>
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAN1F8qUFP3uX57HhiiUPaYqO3h_HiT8Q_YD=vCYky3EAWsdE7Q@mail.gmail.com> <mailman.4666.1373670835.3114.python-list@python.org> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <mailman.5094.1374759404.3114.python-list@python.org> <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5106.1374766576.3114.python-list@python.org> <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> <mailman.5127.1374808181.3114.python-list@python.org> <8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com> <mailman.5160.1374890711.3114.python-list@python.org> <f4bb2528-930e-4c0a-820e-66f00ac2b5b6@googlegroups.com> <51F53E4F.8080104@gmail.com> <kt3kpt$nc4$1@ger.gmane.org>
Date: Sun, 28 Jul 2013 19:03:48 +0100
Subject: Re: FSR and unicode compliance - was Re: RE Module Performance
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5195.1375034638.3114.python-list@python.org>
Lines: 18
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:51391

On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> I posted about a week ago, in response to Chris A., a method by which lookup
> for UTF-16 can be made O(log2 k), or perhaps more accurately,
> O(1+log2(k+1)), where k is the number of non-BMP chars in the string.
>

Which is an optimization choice that favours strings containing very
few non-BMP characters. To justify the extra complexity of out-of-band
storage, you would need to be working with almost exclusively the BMP.
That would drastically improve jmf's microbenchmarks which do exactly
that, but it would penalize strings that are almost exclusively
higher-codepoint characters. Its quality, then, would be based on a
major survey of string usage: are there enough strings with
mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled
better by PEP 393, so this is only of value when there are actually
those mixed strings.

ChrisA