Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'context': 0.05; 'python': 0.09; 'forcing': 0.09; 'pep': 0.09; 'subject:()': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'stored': 0.10; "wouldn't": 0.11; 'dec': 0.15; '8:40': 0.16; 'bug,': 0.16; 'cares': 0.16; 'dump': 0.16; 'non-english': 0.16; 'storing': 0.16; 'subject:3.3': 0.16; 'subject:unicode': 0.16; 'unfair': 0.16; 'wider': 0.16; 'wed,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'unicode': 0.17; '>>>': 0.18; 'memory': 0.18; 'platforms': 0.18; 'trying': 0.21; '3.2': 0.22; 'cc:2**0': 0.23; "i've": 0.23; 'linux': 0.24; 'least': 0.25; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; '----': 0.27; 'am,': 0.27; 'fixed': 0.28; 'actual': 0.28; 'chris': 0.28; 'character.': 0.29; 'represented': 0.29; 'strings,': 0.29; 'thinks': 0.29; '8bit%:5': 0.29; 'code': 0.31; 'anybody': 0.32; 'builds': 0.33; "he's": 0.33; 'problem': 0.33; 'everyone': 0.33; 'received:google.com': 0.34; 'compared': 0.35; 'especially': 0.35; 'doing': 0.35; 'received:209.85': 0.35; 'alone': 0.36; 'characters': 0.36; 'enough': 0.36; 'optimization': 0.37; 'does': 0.37; 'rather': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'things': 0.38; 'sure': 0.38; 'build': 0.39; 'space': 0.39; 'think': 0.40; 'your': 0.60; 'from:no real name:2**0': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'subject:, ': 0.61; 'containing': 0.61; 'solve': 0.62; 'different': 0.63; 'more': 0.63; 'our': 0.65; 'him,': 0.66; '>from': 0.75; 'counts': 0.81; 'all;': 0.84; 'complaint': 0.84; 'moral': 0.84; 'ocean.': 0.84 Newsgroups: comp.lang.python Date: Wed, 19 Dec 2012 13:18:05 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=178.198.163.217; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_ References: <2adb4a25-8ea3-441f-b8c0-ee6c87e4b19f@googlegroups.com> User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 178.198.163.217 MIME-Version: 1.0 Subject: Re: Py 3.3, unicode / upper() From: wxjmfauth@gmail.com To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 72 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1355951888 news.xs4all.nl 6851 [2001:888:2000:d::a6]:40267 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35158 Le mercredi 19 d=E9cembre 2012 19:27:38 UTC+1, Ian a =E9crit=A0: > On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico wrote: >=20 > > You may not be familiar with jmf. He's one of our resident trolls, and >=20 > > he has a bee in his bonnet about PEP 393 strings, on the basis that >=20 > > they take up more space in memory than a narrow build of Python 3.2 >=20 > > would, for a string with lots of BMP characters and one non-BMP. In >=20 > > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate >=20 > > pairs* for non-BMP characters. This means that len() counts them >=20 > > twice, as does string indexing/slicing. That's a major bug, especially >=20 > > as your Python code will do different things on different platforms - >=20 > > most Linux builds of 3.2 are "wide" builds, storing characters in four >=20 > > bytes each. >=20 >=20 >=20 > >From what I've been able to discern, his actual complaint about PEP >=20 > 393 stems from misguided moral concerns. With PEP-393, strings that >=20 > can be fully represented in Latin-1 can be stored in half the space >=20 > (ignoring fixed overhead) compared to strings containing at least one >=20 > non-Latin-1 character. jmf thinks this optimization is unfair to >=20 > non-English users and immoral; he wants Latin-1 strings to be treated >=20 > exactly like non-Latin-1 strings (I don't think he actually cares >=20 > about non-BMP strings at all; if narrow-build Unicode is good enough >=20 > for him, then it must be good enough for everybody). Unfortunately >=20 > for him, the Latin-1 optimization is rather trivial in the wider >=20 > context of PEP-393, and simply removing that part alone clearly >=20 > wouldn't be doing anybody any favors. So for him to get what he >=20 > wants, the entire PEP has to go. >=20 >=20 >=20 > It's rather like trying to solve the problem of wealth disparity by >=20 > forcing everyone to dump their excess wealth into the ocean. ---- latin-1 (iso-8859-1) ? are you sure ? >>> sys.getsizeof('a') 26 >>> sys.getsizeof('ab') 27 >>> sys.getsizeof('a=E9') 39 Time to go to bed. More complete answer tomorrow. jmf