Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com>
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <d575737d-c1e3-47db-9c7b-10fe0300cba7@googlegroups.com> <mailman.3457.1345305136.4697.python-list@python.org> <503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Sun, 19 Aug 2012 11:50:12 -0600
Subject: Re: How do I display unicode value stored in a string variable using ord()
To: "Steven D'Aprano" <steve+comp.lang.python@pearwood.info>
Content-Type: text/plain; charset=ISO-8859-1
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3513.1345398650.4697.python-list@python.org>
Lines: 53
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27407

On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
>> There is some additional benefit for Latin-1 users, but this has nothing
>> to do with Python.  If Python is going to have the option of a 1-byte
>> representation (and as long as we have the flexible representation, I
>> can see no reason not to),
>
> The PEP explicitly states that it only uses a 1-byte format for ASCII
> strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes.  It is not, so I think it must be using
the 1-byte encoding.


> "ASCII-only Unicode strings will again use only one byte per character"

This says nothing one way or the other about non-ASCII Latin-1 strings.

> "If the maximum character is less than 128, they use the PyASCIIObject
> structure"

Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
 The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures.  It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data."  But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

> and:
>
> "The data and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient)."

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory.  It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.