Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.86.MISMATCH!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Fri, 12 Jul 2013 13:58:29 -0400
From: Devyn Collier Johnson <devyncjohnson@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130623 Thunderbird/17.0.7
MIME-Version: 1.0
To: Chris Angelico <rosuav@gmail.com>, Python Mailing List <python-list@python.org>
Subject: Re: RE Module Performance
References: <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAPTjJmpqtq_RVTqeSfABa_nRKKEJ3ZrZ2uGw5Y2xt+B0nbt=zQ@mail.gmail.com>
In-Reply-To: <CAPTjJmpqtq_RVTqeSfABa_nRKKEJ3ZrZ2uGw5Y2xt+B0nbt=zQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.4656.1373657874.3114.python-list@python.org>
Lines: 143
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:50552


On 07/12/2013 12:21 PM, Chris Angelico wrote:
> On Fri, Jul 12, 2013 at 8:45 PM, Devyn Collier Johnson
> <devyncjohnson@gmail.com> wrote:
>> Could you explain what you mean? What and where is the new Flexible String
>> Representation?
> (You're top-posting again. Please put your text underneath what you're
> responding to - it helps maintain flow and structure.)
>
> Python versions up to and including 3.2 came in two varieties: narrow
> builds (commonly found on Windows) and wide builds (commonly found on
> Linux). Narrow builds internally represented Unicode strings in
> UTF-16, while wide builds used UTF-32. This is a problem, because it
> means that taking a program from one to another actually changes its
> behaviour:
>
> Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15)
> [GCC 4.4.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> len(u"\U00012345")
> 1
>
> Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit
> (Intel)] on win32
>>>> len(u"\U00012345")
> 2
>
> In fact, the narrow builds are flat-out buggy, because you can put
> something in as a single character that simply isn't a single
> character. You can then pull that out as two characters and make a
> huge mess of things:
>
>>>> s=u"\U00012345"
>>>> s[0]
> u'\ud808'
>>>> s[1]
> u'\udf45'
>
> *Any* string indexing will be broken if there is a single character
>> U+FFFF ahead of it in the string.
> Now, this problem is not unique to Python. Heaps of other languages
> have the same issue, the same buggy behaviour, the same compromises.
> What's special about Python is that it actually managed to come back
> from that problem. (Google's V8 JavaScript engine, for instance, is
> stuck with it, because the ECMAScript specification demands UTF-16. I
> asked on an ECMAScript list and was told "can't change that, it'd
> break code". So it's staying buggy.)
>
> There are a number of languages that take the Texan RAM-guzzling
> approach of storing all strings in UTF-32; Python (since version 3.3)
> is among a *very* small number of languages that store strings in
> multiple different ways according to their content. That's described
> in PEP 393 [1], titled "Flexible String Representation". It details a
> means whereby a Python string will be represented in, effectively,
> UTF-32 with some of the leading zero bytes elided. Or if you prefer,
> in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
> fit in. The difference between a string stored one-byte-per-character
> and a string stored four-bytes-per-character is almost invisible to a
> Python script; you can find out by checking the string's memory usage,
> but otherwise you don't need to worry about it.
>
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 32 bit (Intel)] on win32
>>>> sys.getsizeof("asdfasdfasdfasd")
> 40
>>>> sys.getsizeof("asdfasdfasdfasdf")
> 41
>
> Adding another character adds another 1 byte. (There's quite a bit of
> overhead for small strings - GC headers and such - but it gets dwarfed
> by the actual content after a while.)
>
>>>> sys.getsizeof("\u1000sdfasdfasdfasd")
> 68
>>>> sys.getsizeof("\u1000sdfasdfasdfasdf")
> 70
>
> Two bytes to add another character.
>
>>>> sys.getsizeof("\U00010001sdfasdfasdfasd")
> 100
>>>> sys.getsizeof("\U00010001sdfasdfasdfasdf")
> 104
>
> Four bytes. It uses only what it needs.
>
> Strings in Python are immutable, so there's no need to worry about
> up-grading or down-grading a string; there are a few optimizations
> that can't be done, but they're fairly trivial. Look, I'll pull a jmf
> and find a microbenchmark that makes 3.3 look worse:
>
> 2.7.4:
>>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
> [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
>>>> timeit.repeat('a=u"A"*100; a+=u"a"')
> [0.8088905154146744, 0.8123691698246631, 0.8172558244134365]
>
> 3.3.0:
>>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
> [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
>>>> timeit.repeat('a=u"A"*100; a+=u"a"')
> [0.7017891938739922, 0.7024725209339522, 0.6989539173082449]
>
> See? It's clearly worse on the newer Python! But actually, this is an
> extremely unusual situation, and 3.3 outperforms 2.7 on the more
> common case (where the two strings are of the same width).
>
> Python's PEP 393 strings are following the same sort of model as the
> native string type in a semantically-similar but
> syntactically-different language, Pike. In Pike (also free software,
> like Python), the string type can be indexed character by character,
> and each character can be anything in the Unicode range; and just as
> in Python 3.3, memory usage goes up by just one byte if every
> character in the string fits inside 8 bits. So it's not as if this is
> an untested notion; Pike has been running like this for years (I don't
> know how long it's had this functionality, but it won't be more than
> 18 years as Unicode didn't have multiple planes until 1996), and
> performance has been *just fine* for all that time. Pike tends to be
> run on servers, so memory usage and computation speed translate fairly
> directly into TPS. And there are some sizeable commercial entities
> using and developing Pike, so if the flexible string representation
> had turned out to be a flop, someone would have put in the coding time
> to rewrite it by now.
>
> And yet, despite all these excellent reasons for moving to this way of
> doing strings, jmf still sees his microbenchmarks as more important,
> and so he jumps in on threads like this to whine about how Python 3.3
> is somehow US-centric because it more efficiently handles the entire
> Unicode range. I'd really like to take some highlights from Python and
> Pike and start recommending that other languages take up the ideas,
> but to be honest, I hesitate to inflict jmf on them all. ECMAScript
> may have the right idea after all - stay with UTF-16 and avoid
> answering jmf's stupid objections every week.
>
> [1] http://www.python.org/dev/peps/pep-0393/
>
> ChrisA
Thanks for the thorough response. I learned a lot. You should write 
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems. 
I would love to amp up my programs that use that module.

Devyn Collier Johnson