Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.008 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'context': 0.05; 'python': 0.09; 'forcing': 0.09; 'pep': 0.09; 'subject:()': 0.09; 'stored': 0.10; "wouldn't": 0.11; 'dec': 0.15; '8:40': 0.16; 'bug,': 0.16; 'cares': 0.16; 'dump': 0.16; 'non-english': 0.16; 'storing': 0.16; 'subject:3.3': 0.16; 'subject:unicode': 0.16; 'unfair': 0.16; 'wider': 0.16; 'wed,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'unicode': 0.17; 'memory': 0.18; 'platforms': 0.18; 'trying': 0.21; '3.2': 0.22; "i've": 0.23; 'linux': 0.24; 'least': 0.25; 'header:In-Reply-To:1': 0.25; 'am,': 0.27; 'message- id:@mail.gmail.com': 0.27; 'fixed': 0.28; 'actual': 0.28; 'chris': 0.28; 'character.': 0.29; 'represented': 0.29; 'strings,': 0.29; 'thinks': 0.29; 'code': 0.31; 'anybody': 0.32; 'builds': 0.33; "he's": 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'everyone': 0.33; 'received:google.com': 0.34; 'compared': 0.35; 'especially': 0.35; 'doing': 0.35; 'received:209.85': 0.35; 'alone': 0.36; 'characters': 0.36; 'enough': 0.36; 'optimization': 0.37; 'does': 0.37; 'rather': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'things': 0.38; 'to:addr:python.org': 0.39; 'build': 0.39; 'space': 0.39; 'header:Received:5': 0.40; 'think': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'subject:, ': 0.61; 'containing': 0.61; 'solve': 0.62; 'different': 0.63; 'more': 0.63; 'our': 0.65; 'him,': 0.66; 'counts': 0.81; 'all;': 0.84; 'complaint': 0.84; 'moral': 0.84; 'ocean.': 0.84; 'to:name:python': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=xgTDHUr/gFo++RtwIGSU8ns6ALBUInwHuckVhaYeGrU=; b=IZZyz9Hh1DaYMZeNnvgKad7DurLu2cGp1YkBlR4JKQbH1Q9S8pAzllyZJC52JFCp8T Qfm6ehEf7moJASDROqcucRJUkyFbL6rb3qz9PH6UAh5poav3dzAvBIUNnoaQtdpj8CAY OhMTvrqqauvMYOHy+vAMfy8QhyMaKyjAY8puL4vjIfAXGkfev96Gf8txNTdHxfYP9TxM 42yzeY0c8xyg5IgS6sGA4HF1tBGJ3eOw6duzc53FkzNWNf9IGCKTSHTynO6CieMtSRvJ Mkj3WCpNBQdXUqK8oxThqq7k2YbCuMj91zIwAs+O8MYkHO7I72O9SQuKFAfNtvqRK5mv i4RQ== MIME-Version: 1.0 In-Reply-To: References: <2adb4a25-8ea3-441f-b8c0-ee6c87e4b19f@googlegroups.com> From: Ian Kelly Date: Wed, 19 Dec 2012 11:27:38 -0700 Subject: Re: Py 3.3, unicode / upper() To: Python Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 28 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1355941696 news.xs4all.nl 6981 [2001:888:2000:d::a6]:53942 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35147 On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico wrote: > You may not be familiar with jmf. He's one of our resident trolls, and > he has a bee in his bonnet about PEP 393 strings, on the basis that > they take up more space in memory than a narrow build of Python 3.2 > would, for a string with lots of BMP characters and one non-BMP. In > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate > pairs* for non-BMP characters. This means that len() counts them > twice, as does string indexing/slicing. That's a major bug, especially > as your Python code will do different things on different platforms - > most Linux builds of 3.2 are "wide" builds, storing characters in four > bytes each. >From what I've been able to discern, his actual complaint about PEP 393 stems from misguided moral concerns. With PEP-393, strings that can be fully represented in Latin-1 can be stored in half the space (ignoring fixed overhead) compared to strings containing at least one non-Latin-1 character. jmf thinks this optimization is unfair to non-English users and immoral; he wants Latin-1 strings to be treated exactly like non-Latin-1 strings (I don't think he actually cares about non-BMP strings at all; if narrow-build Unicode is good enough for him, then it must be good enough for everybody). Unfortunately for him, the Latin-1 optimization is rather trivial in the wider context of PEP-393, and simply removing that part alone clearly wouldn't be doing anybody any favors. So for him to get what he wants, the entire PEP has to go. It's rather like trying to solve the problem of wealth disparity by forcing everyone to dump their excess wealth into the ocean.