Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Wed, 19 Dec 2012 13:18:05 -0800 (PST)
In-Reply-To: <mailman.1068.1355941696.29569.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=178.198.163.217; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
References: <2adb4a25-8ea3-441f-b8c0-ee6c87e4b19f@googlegroups.com> <kaslsb$iue$1@news.albasani.net> <CAPTjJmrLAe0i9rW6sCYkYBvpiPk2O=FHB0PgSq1dqNqh9Y7Zqg@mail.gmail.com> <mailman.1068.1355941696.29569.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Py 3.3, unicode / upper()
From: wxjmfauth@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Python <python-list@python.org>
Precedence: list
Message-ID: <mailman.1073.1355951888.29569.python-list@python.org>
Lines: 72
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35158

Le mercredi 19 d=E9cembre 2012 19:27:38 UTC+1, Ian a =E9crit=A0:
> On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <rosuav@gmail.com> wrote:
>=20
> > You may not be familiar with jmf. He's one of our resident trolls, and
>=20
> > he has a bee in his bonnet about PEP 393 strings, on the basis that
>=20
> > they take up more space in memory than a narrow build of Python 3.2
>=20
> > would, for a string with lots of BMP characters and one non-BMP. In
>=20
> > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
>=20
> > pairs* for non-BMP characters. This means that len() counts them
>=20
> > twice, as does string indexing/slicing. That's a major bug, especially
>=20
> > as your Python code will do different things on different platforms -
>=20
> > most Linux builds of 3.2 are "wide" builds, storing characters in four
>=20
> > bytes each.
>=20
>=20
>=20
> >From what I've been able to discern, his actual complaint about PEP
>=20
> 393 stems from misguided moral concerns.  With PEP-393, strings that
>=20
> can be fully represented in Latin-1 can be stored in half the space
>=20
> (ignoring fixed overhead) compared to strings containing at least one
>=20
> non-Latin-1 character.  jmf thinks this optimization is unfair to
>=20
> non-English users and immoral; he wants Latin-1 strings to be treated
>=20
> exactly like non-Latin-1 strings (I don't think he actually cares
>=20
> about non-BMP strings at all; if narrow-build Unicode is good enough
>=20
> for him, then it must be good enough for everybody).  Unfortunately
>=20
> for him, the Latin-1 optimization is rather trivial in the wider
>=20
> context of PEP-393, and simply removing that part alone clearly
>=20
> wouldn't be doing anybody any favors.  So for him to get what he
>=20
> wants, the entire PEP has to go.
>=20
>=20
>=20
> It's rather like trying to solve the problem of wealth disparity by
>=20
> forcing everyone to dump their excess wealth into the ocean.

----

latin-1 (iso-8859-1) ? are you sure ?

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('a=E9')
39

Time to go to bed. More complete answer tomorrow.

jmf