Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'win32': 0.03; 'broken': 0.03; 'url:pipermail': 0.05; 'ascii': 0.07; 'indexing': 0.07; 'raised': 0.07; 'referring': 0.07; 'python': 0.09; 'before.': 0.09; 'issue:': 0.09; 'msi': 0.09; 'notation': 0.09; 'regression': 0.09; 'sep': 0.09; 'spec': 0.09; 'way:': 0.09; 'bug': 0.10; 'stored': 0.10; 'subject:python': 0.11; '2.7': 0.13; 'index': 0.13; '(var': 0.16; '3.2.': 0.16; '3.3,': 0.16; 'buggy': 0.16; 'build"': 0.16; 'expected,': 0.16; 'foo()': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'installer,': 0.16; 'semantically': 0.16; 'subject:3.3': 0.16; 'subject:String': 0.16; 'thread.': 0.16; 'unicode)': 0.16; 'why,': 0.16; 'wed,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'basically': 0.17; 'fixed.': 0.17; 'instance,': 0.17; 'thu,': 0.17; 'unicode': 0.17; '>>>': 0.18; 'memory': 0.18; 'windows': 0.19; 'versions': 0.20; 'bit': 0.21; 'fairly': 0.21; '3.2': 0.22; "i'd": 0.22; 'split': 0.23; 'long,': 0.24; 'linux': 0.24; 'script': 0.24; 'header:In-Reply-To:1': 0.25; '(which': 0.26; 'common': 0.26; 'am,': 0.27; 'bugs': 0.27; '2.6': 0.27; 'see,': 0.27; 'message- id:@mail.gmail.com': 0.27; "doesn't": 0.28; 'chris': 0.28; 'character.': 0.29; 'represented': 0.29; 'character': 0.29; 'included': 0.29; "skip:' 10": 0.30; 'function': 0.30; 'up.': 0.31; 'code': 0.31; 'says': 0.33; 'builds': 0.33; 'impression': 0.33; 'skip:j 20': 0.33; 'ubuntu': 0.33; 'problem': 0.33; 'to:addr :python-list': 0.33; 'version': 0.34; "can't": 0.34; 'received:google.com': 0.34; 'list': 0.35; 'compared': 0.35; 'platforms,': 0.35; 'pm,': 0.35; 'too.': 0.35; 'there': 0.35; 'but': 0.36; 'url:org': 0.36; 'be.': 0.36; 'useful': 0.36; 'should': 0.36; 'possible': 0.37; 'skip:t 40': 0.37; 'does': 0.37; 'two': 0.37; 'being': 0.37; 'rather': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'some': 0.38; 'things': 0.38; '2010,': 0.38; 'performance': 0.39; 'to:addr:python.org': 0.39; 'build': 0.39; 'google': 0.39; 'little': 0.39; 'url:mail': 0.40; 'skip:u 10': 0.60; 'chance': 0.61; "you'll": 0.62; 'wide': 0.62; 'is.': 0.62; 'thomas': 0.62; 'upgrading': 0.62; 'virus: References: <23a42297-9262-4ace-87ad-138999b1ddd6@z3g2000vbg.googlegroups.com> <2992273.neLn1eVAPo@PointedEars.de> Date: Thu, 14 Mar 2013 11:19:11 +1100 Subject: Re: String performance regression from python 3.2 to 3.3 From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 97 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1363220353 news.xs4all.nl 6939 [2001:888:2000:d::a6]:42229 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:41201 On Thu, Mar 14, 2013 at 4:42 AM, Thomas 'PointedEars' Lahn wrote: > Chris Angelico wrote: > >> On Wed, Mar 13, 2013 at 9:11 PM, rusi wrote: >>> Uhhh.. >>> Making the subject line useful for all readers >> >> I should have read this one before replying in the other thread. >> >> jmf, I'd like to see evidence that there has been a performance >> regression compared against a wide build of Python 3.2. You still have >> never answered this fundamental, that the narrow builds of Python are >> *BUGGY* in the same way that JavaScript/ECMAScript is. > > Interesting. From my work I was under the impression that I knew ECMAScr= ipt > and its implementations fairly well, yet I have never heard of this befor= e. > > What do you mean by =93narrow build=94 and =93wide build=94 and what exac= tly is the > bug =93narrow builds=94 of Python 3.2 have in common with JavaScript/ECMA= Script? > To which implementation of ECMAScript are you referring =96 or are you > referring to the Specification as such? The ECMAScript spec says that strings are stored and represented in UTF-16. Python versions up to 3.2 came in two varieties: narrow, which included (I believe) the Windows builds available on python.org, and wide, which was (again, I think) the default Linux config. The problem predates Python 3 and its default string being Unicode - the Py2 unicode type has the same issue: Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32 >>> u"\U00012345" u'\U00012345' >>> len(_) 2 Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u"\U00012345" u'\U00012345' >>> len(_) 1 That's the Python msi installer, and the default system Python from an Ubuntu 10.10. The exact same code does different things on different platforms, and on the Windows (narrow-build), it's possible to split surrogates: >>> u"\U00012345"[0] u'\ud808' >>> u"\U00012345"[1] u'\udf45' You can see the same thing in Javascript too. Here's a little demo I just knocked together:
Give it an ASCII string and you'll see, as expected, one index (based on string indexing or charCodeAt, same thing) for each character. Same if it's all BMP. But put an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR notation doesn't work in Unicode) come up. I raised this issue on the Google V8 list and on the ECMAScript list es-discuss@mozilla.org, and was basically told that since JavaScript has been buggy for so long, there's no chance of ever making it bug-free: https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html Fortunately for Python, there are version numbers, and policies that permit bugs to actually get fixed. (Which is why, for instance, Debian Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case some script is broken by that change. Can't do that with web browsers.) As of Python 3.3, all Pythons function the same way: it's semantically a "wide build" (UTF-32), but with a memory usage optimization. That's how it needs to be. ChrisA