Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50552
| Date | 2013-07-12 13:58 -0400 |
|---|---|
| From | Devyn Collier Johnson <devyncjohnson@gmail.com> |
| Subject | Re: RE Module Performance |
| References | <mailman.4618.1373613834.3114.python-list@python.org> <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> <CAPTjJmpqtq_RVTqeSfABa_nRKKEJ3ZrZ2uGw5Y2xt+B0nbt=zQ@mail.gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.4656.1373657874.3114.python-list@python.org> (permalink) |
On 07/12/2013 12:21 PM, Chris Angelico wrote:
> On Fri, Jul 12, 2013 at 8:45 PM, Devyn Collier Johnson
> <devyncjohnson@gmail.com> wrote:
>> Could you explain what you mean? What and where is the new Flexible String
>> Representation?
> (You're top-posting again. Please put your text underneath what you're
> responding to - it helps maintain flow and structure.)
>
> Python versions up to and including 3.2 came in two varieties: narrow
> builds (commonly found on Windows) and wide builds (commonly found on
> Linux). Narrow builds internally represented Unicode strings in
> UTF-16, while wide builds used UTF-32. This is a problem, because it
> means that taking a program from one to another actually changes its
> behaviour:
>
> Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
> [GCC 4.4.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> len(u"\U00012345")
> 1
>
> Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
> (Intel)] on win32
>>>> len(u"\U00012345")
> 2
>
> In fact, the narrow builds are flat-out buggy, because you can put
> something in as a single character that simply isn't a single
> character. You can then pull that out as two characters and make a
> huge mess of things:
>
>>>> s=u"\U00012345"
>>>> s[0]
> u'\ud808'
>>>> s[1]
> u'\udf45'
>
> *Any* string indexing will be broken if there is a single character
>> U+FFFF ahead of it in the string.
> Now, this problem is not unique to Python. Heaps of other languages
> have the same issue, the same buggy behaviour, the same compromises.
> What's special about Python is that it actually managed to come back
> from that problem. (Google's V8 JavaScript engine, for instance, is
> stuck with it, because the ECMAScript specification demands UTF-16. I
> asked on an ECMAScript list and was told "can't change that, it'd
> break code". So it's staying buggy.)
>
> There are a number of languages that take the Texan RAM-guzzling
> approach of storing all strings in UTF-32; Python (since version 3.3)
> is among a *very* small number of languages that store strings in
> multiple different ways according to their content. That's described
> in PEP 393 [1], titled "Flexible String Representation". It details a
> means whereby a Python string will be represented in, effectively,
> UTF-32 with some of the leading zero bytes elided. Or if you prefer,
> in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
> fit in. The difference between a string stored one-byte-per-character
> and a string stored four-bytes-per-character is almost invisible to a
> Python script; you can find out by checking the string's memory usage,
> but otherwise you don't need to worry about it.
>
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 32 bit (Intel)] on win32
>>>> sys.getsizeof("asdfasdfasdfasd")
> 40
>>>> sys.getsizeof("asdfasdfasdfasdf")
> 41
>
> Adding another character adds another 1 byte. (There's quite a bit of
> overhead for small strings - GC headers and such - but it gets dwarfed
> by the actual content after a while.)
>
>>>> sys.getsizeof("\u1000sdfasdfasdfasd")
> 68
>>>> sys.getsizeof("\u1000sdfasdfasdfasdf")
> 70
>
> Two bytes to add another character.
>
>>>> sys.getsizeof("\U00010001sdfasdfasdfasd")
> 100
>>>> sys.getsizeof("\U00010001sdfasdfasdfasdf")
> 104
>
> Four bytes. It uses only what it needs.
>
> Strings in Python are immutable, so there's no need to worry about
> up-grading or down-grading a string; there are a few optimizations
> that can't be done, but they're fairly trivial. Look, I'll pull a jmf
> and find a microbenchmark that makes 3.3 look worse:
>
> 2.7.4:
>>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
> [0.8175005482540385, 0.789617954237201, 0.8152240019332098]
>>>> timeit.repeat('a=u"A"*100; a+=u"a"')
> [0.8088905154146744, 0.8123691698246631, 0.8172558244134365]
>
> 3.3.0:
>>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
> [0.9623714745976031, 0.970628669281723, 0.9696310564468149]
>>>> timeit.repeat('a=u"A"*100; a+=u"a"')
> [0.7017891938739922, 0.7024725209339522, 0.6989539173082449]
>
> See? It's clearly worse on the newer Python! But actually, this is an
> extremely unusual situation, and 3.3 outperforms 2.7 on the more
> common case (where the two strings are of the same width).
>
> Python's PEP 393 strings are following the same sort of model as the
> native string type in a semantically-similar but
> syntactically-different language, Pike. In Pike (also free software,
> like Python), the string type can be indexed character by character,
> and each character can be anything in the Unicode range; and just as
> in Python 3.3, memory usage goes up by just one byte if every
> character in the string fits inside 8 bits. So it's not as if this is
> an untested notion; Pike has been running like this for years (I don't
> know how long it's had this functionality, but it won't be more than
> 18 years as Unicode didn't have multiple planes until 1996), and
> performance has been *just fine* for all that time. Pike tends to be
> run on servers, so memory usage and computation speed translate fairly
> directly into TPS. And there are some sizeable commercial entities
> using and developing Pike, so if the flexible string representation
> had turned out to be a flop, someone would have put in the coding time
> to rewrite it by now.
>
> And yet, despite all these excellent reasons for moving to this way of
> doing strings, jmf still sees his microbenchmarks as more important,
> and so he jumps in on threads like this to whine about how Python 3.3
> is somehow US-centric because it more efficiently handles the entire
> Unicode range. I'd really like to take some highlights from Python and
> Pike and start recommending that other languages take up the ideas,
> but to be honest, I hesitate to inflict jmf on them all. ECMAScript
> may have the right idea after all - stay with UTF-16 and avoid
> answering jmf's stupid objections every week.
>
> [1] http://www.python.org/dev/peps/pep-0393/
>
> ChrisA
Thanks for the thorough response. I learned a lot. You should write
articles on Python.
I plan to spend some time optimizing the re.py module for Unix systems.
I would love to amp up my programs that use that module.
Devyn Collier Johnson
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700
csiph-web