Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.017 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'string.': 0.05; 'string': 0.09; 'lookup': 0.09; 'mixed': 0.09; 'pep': 0.09; 'storage,': 0.09; 'strings.': 0.09; 'posted': 0.15; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'reedy': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.18; 'header:In-Reply-To:1': 0.27; 'chris': 0.29; 'message-id:@mail.gmail.com': 0.30; 'handled': 0.32; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'method': 0.36; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'that,': 0.38; 'to:addr:python.org': 0.39; 'enough': 0.39; 'major': 0.40; 'then,': 0.60; 'ago,': 0.61; 'more': 0.64; 'containing': 0.69; 'jul': 0.74; 'complexity': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+hcvwlEK1IoemAaY5ApYRU79ba4r6u6BykxZ6VB8e6g=; b=rktT1ys6tdFNrUFaCXF1JWi8aeFfKu8gAkur9KLnym/ggCQqHIKDEyqi3ulC1PUE6T hBVHMWoZ5lKsPScS8TcHCP25Mwr8HFG0M7wX+moXWzx3AxohnUlYiK+kl8t5/kpN5Bpz ck4DXeSEJhdOdPdiWkxRo8hKLh4vDNQYeRkybNG/f/Q0cBdYDJrp+1Sp3O/5H0fMkygZ cxmhF7GFLRb11O3OUJzW+vzjj5ozbMjngN4wHph55VmmCYJDkZTYFi7xwp0Fe3E9mfDH HKJH9Ap8e91Wd2et+OIk4bvXNK/tnyBfCbAwJ0h02RsvV9zB9g0Zpt1IUbBA687TzKGZ H9Eg== MIME-Version: 1.0 X-Received: by 10.59.9.69 with SMTP id dq5mr23782773ved.87.1375034628930; Sun, 28 Jul 2013 11:03:48 -0700 (PDT) In-Reply-To: References: <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <4f1067f6-bc99-42ad-9166-37fb228b90e8@googlegroups.com> <51f14395$0$29971$c3e8da3$5496439d@news.astraweb.com> <51f15e03$0$29971$c3e8da3$5496439d@news.astraweb.com> <8203e802-9dc5-44c5-9547-6e1947ee224b@googlegroups.com> <51F53E4F.8080104@gmail.com> Date: Sun, 28 Jul 2013 19:03:48 +0100 Subject: Re: FSR and unicode compliance - was Re: RE Module Performance From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 18 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1375034638 news.xs4all.nl 15878 [2001:888:2000:d::a6]:50089 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:51391 On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy wrote: > I posted about a week ago, in response to Chris A., a method by which lookup > for UTF-16 can be made O(log2 k), or perhaps more accurately, > O(1+log2(k+1)), where k is the number of non-BMP chars in the string. > Which is an optimization choice that favours strings containing very few non-BMP characters. To justify the extra complexity of out-of-band storage, you would need to be working with almost exclusively the BMP. That would drastically improve jmf's microbenchmarks which do exactly that, but it would penalize strings that are almost exclusively higher-codepoint characters. Its quality, then, would be based on a major survey of string usage: are there enough strings with mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled better by PEP 393, so this is only of value when there are actually those mixed strings. ChrisA