Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!feed.xsnews.nl!border-2.ams.xsnews.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Date: Sun, 19 Aug 2012 09:43:13 +0200
Organization: None
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <d575737d-c1e3-47db-9c7b-10fe0300cba7@googlegroups.com> <mailman.3457.1345305136.4697.python-list@python.org> <4c62a649-bc21-4e47-9c0f-acb1b1e70e36@googlegroups.com> <mailman.3462.1345310859.4697.python-list@python.org> <f9beca36-3a12-41f2-bdc2-95b159c162d1@googlegroups.com> <mailman.3468.1345314897.4697.python-list@python.org> <5030891f$0$29978$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8Bit
User-Agent: KNode/4.7.3
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3485.1345362201.4697.python-list@python.org>
Lines: 37
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27357

Steven D'Aprano wrote:

> On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
> 
>> "a" will be stored as 1 byte/codepoint.
>> 
>> Adding "é", it will still be stored as 1 byte/codepoint.
> 
> Wrong. It will be 2 bytes, just like it already is in Python 3.2.
> 
> I don't know where people are getting this myth that PEP 393 uses Latin-1
> internally, it does not. Read the PEP, it explicitly states that 1-byte
> formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) 
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> [sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>>> [sys.getsizeof("e"*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
>>> sys.getsizeof("é"*101)-sys.getsizeof("é")
100
>>> sys.getsizeof("e"*101)-sys.getsizeof("e")
100
>>> sys.getsizeof("€"*101)-sys.getsizeof("€")
200

I infer that 

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) 
over ASCII-only.