Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.86.MISMATCH!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'win32': 0.03; 'broken': 0.04; 'skip:[ 20': 0.04; 'string.': 0.05; 'important,': 0.07; 'indexing': 0.07; 'responding': 0.07; 'smallest': 0.07; 'string': 0.09; 'answering': 0.09; 'bytes.': 0.09; 'character,': 0.09; 'collier': 0.09; 'mess': 0.09; 'optimizing': 0.09; 'pep': 0.09; 'rewrite': 0.09; 'string;': 0.09; 'worse': 0.09; 'translate': 0.10; 'python': 0.11; 'language,': 0.12; 'stored': 0.12; 'systems.': 0.12; '2.7': 0.14; 'changes': 0.15; '"can\'t': 0.16; '(also': 0.16; "(you're": 0.16; '*any*': 0.16; '2.6.4': 0.16; '3.3,': 0.16; 'bits.': 0.16; 'buggy': 0.16; 'character.': 0.16; 'dwarfed': 0.16; 'fits': 0.16; 'immutable,': 0.16; 'internally': 0.16; 'invisible': 0.16; 'jumps': 0.16; 'linux).': 0.16; 'objections': 0.16; 'planes': 0.16; 'python),': 0.16; 'range.': 0.16; 's[0]': 0.16; 'sees': 0.16; 'storing': 0.16; 'tends': 0.16; 'titled': 0.16; 'url:peps': 0.16; 'usage,': 0.16; 'all.': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'module': 0.19; '(where': 0.19; "python's": 0.19; 'thorough': 0.19; 'fit': 0.20; 'memory': 0.22; 'coding': 0.22; 'handles': 0.22; 'header:User- Agent:1': 0.23; 'adds': 0.24; 'byte': 0.24; 'bytes': 0.24; 'headers': 0.24; 'instance,': 0.24; 'issue,': 0.24; 'unicode': 0.24; 'url:dev': 0.24; 'fairly': 0.24; 'versions': 0.24; 'sort': 0.25; 'developing': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'to:2**1': 0.27; 'idea': 0.28; 'chris': 0.29; '[1]': 0.29; 'character': 0.29; 'unix': 0.29; 'characters': 0.30; 'dec': 0.30; 'newer': 0.30; 'skip:( 20': 0.30; 'asked': 0.31; 'received:10.0.0': 0.31; 'software,': 0.31; '(since': 0.31; '2009,': 0.31; '3.2': 0.31; '>>>>': 0.31; 'lot.': 0.31; 'overhead': 0.31; 'sep': 0.31; 'staying': 0.31; 'languages': 0.32; 'run': 0.32; 'another': 0.32; 'quite': 0.32; 'text': 0.33; 'week.': 0.33; 'url:python': 0.33; 'running': 0.33; 'checking': 0.33; 'fri,': 0.33; 'actual': 0.34; "i'd": 0.34; 'could': 0.34; 'problem': 0.35; "can't": 0.35; 'common': 0.35; 'problem.': 0.35; 'skip:s 30': 0.35; 'something': 0.35; 'johnson': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'there': 0.35; 'version': 0.36; 'really': 0.36; 'different': 0.65; 'love': 0.65; 'taking': 0.65; 'articles': 0.65; 'to:addr:gmail.com': 0.65; 'details': 0.65; 'between': 0.67; 'response.': 0.68; 'fact,': 0.69; 'hesitate': 0.70; 'commercial': 0.71; 'jul': 0.74; 'unusual': 0.74; 'special': 0.74; '100': 0.79; 'actually,': 0.84; 'effectively,': 0.84; "it'd": 0.84; 'pike': 0.84; 'underneath': 0.84; 'whereby': 0.84; '2013,': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=pjnFg0SGySpbrz4Twi9ofJT15RAKJOSOnxZQC99kjNc=; b=HErrELiFPF2rTuxvwf8K3d6RfKnPjyS7L+Fb5Ozpo1GqUaawoc7gXhcs6Vd6Etde27 Mc7L1mjp0zbw7MQnGiseaxwgHhz8W2LEXvIfiXNsVv6w6orM6iHwUxqFPbI4faAYrc5v v2vKrldoWrZVBwptXPXG8CvzQjPn5OscuMA/RzGtsYQcU8G078MKIviAYhtbIq2ztGkC ntjpLOi2HrUGvzkQXKsX4xxcJoY9RyOzFX4inViNpKR2oUF/gaBytzaWQCFsEX9pjGG3 wOwsNXMMidA9PoKTXNKC/NzaQucM9dDa1EXeXpEpA6vitYJWDZHYGb21X0x32KyGWOp9 tgQw== X-Received: by 10.49.84.164 with SMTP id a4mr36640019qez.4.1373651912585; Fri, 12 Jul 2013 10:58:32 -0700 (PDT) Date: Fri, 12 Jul 2013 13:58:29 -0400 From: Devyn Collier Johnson User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130623 Thunderbird/17.0.7 MIME-Version: 1.0 To: Chris Angelico , Python Mailing List Subject: Re: RE Module Performance References: <571a6dfe-fd66-42cf-92fc-8b97cbe6e9e4@googlegroups.com> <51DFDE65.5040001@Gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Fri, 12 Jul 2013 21:37:52 +0200 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 143 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1373657874 news.xs4all.nl 15961 [2001:888:2000:d::a6]:46982 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:50552 On 07/12/2013 12:21 PM, Chris Angelico wrote: > On Fri, Jul 12, 2013 at 8:45 PM, Devyn Collier Johnson > wrote: >> Could you explain what you mean? What and where is the new Flexible String >> Representation? > (You're top-posting again. Please put your text underneath what you're > responding to - it helps maintain flow and structure.) > > Python versions up to and including 3.2 came in two varieties: narrow > builds (commonly found on Windows) and wide builds (commonly found on > Linux). Narrow builds internally represented Unicode strings in > UTF-16, while wide builds used UTF-32. This is a problem, because it > means that taking a program from one to another actually changes its > behaviour: > > Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) > [GCC 4.4.1] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> len(u"\U00012345") > 1 > > Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit > (Intel)] on win32 >>>> len(u"\U00012345") > 2 > > In fact, the narrow builds are flat-out buggy, because you can put > something in as a single character that simply isn't a single > character. You can then pull that out as two characters and make a > huge mess of things: > >>>> s=u"\U00012345" >>>> s[0] > u'\ud808' >>>> s[1] > u'\udf45' > > *Any* string indexing will be broken if there is a single character >> U+FFFF ahead of it in the string. > Now, this problem is not unique to Python. Heaps of other languages > have the same issue, the same buggy behaviour, the same compromises. > What's special about Python is that it actually managed to come back > from that problem. (Google's V8 JavaScript engine, for instance, is > stuck with it, because the ECMAScript specification demands UTF-16. I > asked on an ECMAScript list and was told "can't change that, it'd > break code". So it's staying buggy.) > > There are a number of languages that take the Texan RAM-guzzling > approach of storing all strings in UTF-32; Python (since version 3.3) > is among a *very* small number of languages that store strings in > multiple different ways according to their content. That's described > in PEP 393 [1], titled "Flexible String Representation". It details a > means whereby a Python string will be represented in, effectively, > UTF-32 with some of the leading zero bytes elided. Or if you prefer, > in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can > fit in. The difference between a string stored one-byte-per-character > and a string stored four-bytes-per-character is almost invisible to a > Python script; you can find out by checking the string's memory usage, > but otherwise you don't need to worry about it. > > Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 > 32 bit (Intel)] on win32 >>>> sys.getsizeof("asdfasdfasdfasd") > 40 >>>> sys.getsizeof("asdfasdfasdfasdf") > 41 > > Adding another character adds another 1 byte. (There's quite a bit of > overhead for small strings - GC headers and such - but it gets dwarfed > by the actual content after a while.) > >>>> sys.getsizeof("\u1000sdfasdfasdfasd") > 68 >>>> sys.getsizeof("\u1000sdfasdfasdfasdf") > 70 > > Two bytes to add another character. > >>>> sys.getsizeof("\U00010001sdfasdfasdfasd") > 100 >>>> sys.getsizeof("\U00010001sdfasdfasdfasdf") > 104 > > Four bytes. It uses only what it needs. > > Strings in Python are immutable, so there's no need to worry about > up-grading or down-grading a string; there are a few optimizations > that can't be done, but they're fairly trivial. Look, I'll pull a jmf > and find a microbenchmark that makes 3.3 look worse: > > 2.7.4: >>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"') > [0.8175005482540385, 0.789617954237201, 0.8152240019332098] >>>> timeit.repeat('a=u"A"*100; a+=u"a"') > [0.8088905154146744, 0.8123691698246631, 0.8172558244134365] > > 3.3.0: >>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"') > [0.9623714745976031, 0.970628669281723, 0.9696310564468149] >>>> timeit.repeat('a=u"A"*100; a+=u"a"') > [0.7017891938739922, 0.7024725209339522, 0.6989539173082449] > > See? It's clearly worse on the newer Python! But actually, this is an > extremely unusual situation, and 3.3 outperforms 2.7 on the more > common case (where the two strings are of the same width). > > Python's PEP 393 strings are following the same sort of model as the > native string type in a semantically-similar but > syntactically-different language, Pike. In Pike (also free software, > like Python), the string type can be indexed character by character, > and each character can be anything in the Unicode range; and just as > in Python 3.3, memory usage goes up by just one byte if every > character in the string fits inside 8 bits. So it's not as if this is > an untested notion; Pike has been running like this for years (I don't > know how long it's had this functionality, but it won't be more than > 18 years as Unicode didn't have multiple planes until 1996), and > performance has been *just fine* for all that time. Pike tends to be > run on servers, so memory usage and computation speed translate fairly > directly into TPS. And there are some sizeable commercial entities > using and developing Pike, so if the flexible string representation > had turned out to be a flop, someone would have put in the coding time > to rewrite it by now. > > And yet, despite all these excellent reasons for moving to this way of > doing strings, jmf still sees his microbenchmarks as more important, > and so he jumps in on threads like this to whine about how Python 3.3 > is somehow US-centric because it more efficiently handles the entire > Unicode range. I'd really like to take some highlights from Python and > Pike and start recommending that other languages take up the ideas, > but to be honest, I hesitate to inflict jmf on them all. ECMAScript > may have the right idea after all - stay with UTF-16 and avoid > answering jmf's stupid objections every week. > > [1] http://www.python.org/dev/peps/pep-0393/ > > ChrisA Thanks for the thorough response. I learned a lot. You should write articles on Python. I plan to spend some time optimizing the re.py module for Unix systems. I would love to amp up my programs that use that module. Devyn Collier Johnson