Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; '16,': 0.03; 'broken': 0.03; 'argument': 0.04; 'handler': 0.04; 'url:pipermail': 0.05; 'ascii': 0.07; 'attributes': 0.07; 'character,': 0.07; 'indexing': 0.07; 'raised': 0.07; 'utf-8': 0.07; 'scripts': 0.09; 'python': 0.09; '(bmp)': 0.09; '16-bit': 0.09; 'already.': 0.09; 'before.': 0.09; 'bits.': 0.09; 'does,': 0.09; 'encoding.': 0.09; 'notation': 0.09; 'spec': 0.09; 'storage.': 0.09; 'way:': 0.09; 'stored': 0.10; 'language,': 0.11; 'subject:python': 0.11; 'assume': 0.11; '2.7': 0.13; 'index': 0.13; 'language': 0.14; 'library': 0.15; 'applies': 0.15; 'encoding': 0.15; "hasn't": 0.15; 'sat,': 0.15; '"use': 0.16; '(var': 0.16; '3.3,': 0.16; 'argument.': 0.16; 'bug,': 0.16; 'buggy': 0.16; 'build"': 0.16; 'cheap,': 0.16; 'conforming': 0.16; 'did.': 0.16; 'expected,': 0.16; 'foo()': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'function).': 0.16; "isn't.": 0.16; 'numbered': 0.16; 'sees': 0.16; 'semantically': 0.16; 'stuff.': 0.16; 'subject:3.3': 0.16; 'subject:String': 0.16; 'supported:': 0.16; 'surrogates': 0.16; 'textual': 0.16; 'too).': 0.16; 'unicode)': 0.16; 'unicode.': 0.16; 'us-ascii': 0.16; 'version?': 0.16; 'why,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'basically': 0.17; 'element': 0.17; 'fixed.': 0.17; 'instance,': 0.17; 'specifies': 0.17; 'specify': 0.17; 'string,': 0.17; 'unicode': 0.17; 'saying': 0.18; 'memory': 0.18; 'code.': 0.20; 'skip:" 30': 0.20; 'bit': 0.21; '3.2': 0.22; 'algorithms.': 0.22; 'assuming': 0.22; 'converted': 0.22; 'embedding': 0.22; 'defined': 0.22; '(i.e.,': 0.23; 'elements': 0.23; 'long,': 0.24; 'second': 0.24; 'script': 0.24; 'specifically': 0.24; 'header:In-Reply-To:1': 0.25; '(which': 0.26; 'common': 0.26; 'values': 0.26; 'bugs': 0.27; 'language.': 0.27; 'separate': 0.27; '(as': 0.27; '2.6': 0.27; 'operations,': 0.27; 'see,': 0.27; 'message-id:@mail.gmail.com': 0.27; "doesn't": 0.28; 'actual': 0.28; 'chris': 0.28; 'initial': 0.28; 'represent': 0.28; '3.1': 0.29; 'acceptance': 0.29; 'character.': 0.29; 'piece': 0.29; 'represented': 0.29; 'strings,': 0.29; "they'll": 0.29; 'character': 0.29; 'no,': 0.29; 'points': 0.29; "i'm": 0.29; 'maybe': 0.29; "skip:' 10": 0.30; '(from': 0.30; 'ends': 0.30; 'position.': 0.30; 'basic': 0.30; 'function': 0.30; 'up.': 0.31; 'code': 0.31; 'buying': 0.69; 'answer.': 0.71; '(based': 0.84; '(oh': 0.84; '2013': 0.84; 'bmp,': 0.84; 'demonstrates': 0.84; 'divide': 0.84; 'each,': 0.84; 'ever,': 0.84; 'forced': 0.84; 'fortunately': 0.84; 'ships': 0.84; 'edition': 0.86; 'url:mozilla': 0.91; 'wait,': 0.93; 'imagine': 0.96; 'serious': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; bh=f2J6zlvCW8O9Uop1FSAUT3c5n9i9vXrFthPy6SbvDWs=; b=vHWvbyNHI3KRuYPf72zAyGjYbNMbJoxa5tVsiZAfLvqZBWn2VYks8LojtLXpNYMO0F j/dQia5W5BVydK9lAmcOazb8eO1jnK4DisfrofInX1UgBX6S3zwRRvyVnNn0mxyNyH/Y 4ErAnVz18ZEZUaKKTpe2V9lJaVvxCWg/FCo1TrGAZBGhFKIFigxN5Qjj+/Zbefyzb8dM xn7jX9E2dGsHcARDgj0lYiNghDnvt/Va0t6i7qqaXaHYx1xdK9Xb9YwcGTV9D0jRl3Pw xB7NBZCw/cZIab3Z/VDKTNyD7gxEHM363kAefyA+Xi8cuAv335tlAVy0RXwLKb5sl7WR X3Wg== MIME-Version: 1.0 X-Received: by 10.220.231.199 with SMTP id jr7mr9822121vcb.70.1363406388074; Fri, 15 Mar 2013 20:59:48 -0700 (PDT) In-Reply-To: <2202673.rtQqbKup0V@PointedEars.de> References: <23a42297-9262-4ace-87ad-138999b1ddd6@z3g2000vbg.googlegroups.com> <2992273.neLn1eVAPo@PointedEars.de> <2202673.rtQqbKup0V@PointedEars.de> Date: Sat, 16 Mar 2013 14:59:47 +1100 Subject: Re: String performance regression from python 3.2 to 3.3 From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 156 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1363406390 news.xs4all.nl 6895 [2001:888:2000:d::a6]:50804 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:41291 On Sat, Mar 16, 2013 at 1:44 PM, Thomas 'PointedEars' Lahn wrote: > Chris Angelico wrote: >> The ECMAScript spec says that strings are stored and represented in >> UTF-16. > > No, it does not (which Edition?). It says in Edition 5.1: Okay, I was sloppy in my terminology. A language will seldom, if ever, specify the actual storage. But it does specify a representation (to the script) of UTF-16, and I seriously cannot imagine any reason for an implementation to store a string in any other way, given that string indexing is specifically based on UTF-16: > | The length of a > | String is the number of elements (i.e., 16-bit values) within it. > | > | [=85] > | When a String contains actual textual data, each element is considered = to > | be a single UTF-16 code unit. Whether or not this is the actual storag= e > | format of a String, the characters within a String are numbered by > | their initial code unit element position as though they were represente= d > | using UTF-16. So, yes, it could be stored in some other way, but in terms of what I was saying (comparing against Python 3.2 and 3.3), it's still a specification that doesn't allow for the change that Python did. If narrow builds are all you compare against (as jmf does), then Python 3.2 is exactly like ECMAScript, and Python 3.3 isn't. >> You can see the same thing in Javascript too. Here's a little demo I >> just knocked together: >> >> >> > value=3D"Show">
> > What an awful piece of code. Ehh, it's designed to be short, not beautiful. Got any serious criticisms of it? It demonstrates what I'm talking about without being a page of code. >> Give it an ASCII string > > You mean a string of Unicode characters that can also be represented with > the US-ASCII encoding. There are no "ASCII strings" in conforming > ECMAScript implementations. And a string of Unicode characters with code > points within the BMP will suffice already. You can get a string of ASCII characters and paste them into the entry field. They'll be turned into Unicode characters before the script sees them. But yes, okay, my terminology was a bit sloppy. >> and you'll see, as expected, one index (based on string indexing or >> charCodeAt, same thing) for each character. Same if it's all BMP. But pu= t >> an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR >> notation doesn't work in Unicode) come up. I raised this issue on the >> Google V8 list and on the ECMAScript list es-discuss@mozilla.org, and wa= s >> basically told that since JavaScript has been buggy for so long, there's >> no chance of ever making it bug-free: >> >> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html > > You misunderstand, and I am not buying Rick's answer. The problem is not > that String values are defined as units of 16 bits. The problem is that = the > length of a primitive String value in ECMAScript, and the position of a > character, is defined in terms of 16-bit units instead of characters. Th= ere > is no bug, because ECMAScript specifies that Unicode characters beyond th= e > Basic Multilingual Plane (BMP) need not be supported: So what you're saying is that an ES implementation is allowed to be even buggier than I described, and that's somehow a justification? > People have found ways to make this work in ECMAScript implementations. = For > example, it is possible to scan a normalized string for lead surrogates: And it's possible to write a fully conforming Unicode handler in C, using char[] and relying on (say) UTF-8 encoding. That has nothing to do with the language actually providing facilities. > But yes, there should be native support for Unicode characters with code > points beyond the BMP, and evidently that does _not_ require a second > language; just a few tweaks to the algorithms. No, it requires either a complete change of the language, or the acceptance that O(1) operations can now become O(n) on the length of the string (if the string is left in UTF-16 but indexed in Unicode), or the creation of a new user-space data type (which then has to be converted any time it's given to any standard library function). >> Fortunately for Python, there are version numbers, and policies that >> permit bugs to actually get fixed. (Which is why, for instance, Debian >> Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case >> some script is broken by that change. > > Debian already ships Python 3.1 in Stable, disproving your argument. Separate branch. Debian stable ships one from each branch; Debian unstable does, too (2.7.3 and 3.2.3). Same argument applies to each, though - even Debian unstable hasn't yet introduced Python 3.3, in case it breaks stuff. Argument not disproved. >> Can't do that with web browsers.) > > Yes, you could. It has been done before. Not easily. Assuming you can't make one a perfect super/subset of the other (as with "use strict"), it needs to be done as a completely separate language. Now, maybe it's time the