Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'win32': 0.03; 'explicitly': 0.04; 'memory.': 0.05; 'sys': 0.05; '(using': 0.07; 'ascii': 0.07; 'bytes.': 0.07; 'false.': 0.07; 'objects,': 0.07; 'strings.': 0.07; 'used.': 0.07; 'users,': 0.07; 'utf-8': 0.07; 'subject:How': 0.09; 'python': 0.09; '"if': 0.09; 'compact': 0.09; 'derived': 0.09; 'encoding.': 0.09; 'imply': 0.09; 'internally': 0.09; 'non-ascii': 0.09; 'pep': 0.09; 'pointers': 0.09; 'received:mail-lpp01m010-f46.google.com': 0.09; 'structure,': 0.09; 'subject:()': 0.09; 'subject:string': 0.09; 'subject:using': 0.09; 'utf8': 0.09; 'cc:addr:python-list': 0.10; 'stored': 0.10; 'aug': 0.13; 'sat,': 0.15; '329': 0.16; 'ascii,': 0.16; 'subject:unicode': 0.16; 'subject:variable': 0.16; 'to:addr:pearwood.info': 0.16; 'to:addr:steve+comp.lang.python': 0.16; "to:name:steven d'aprano": 0.16; 'wording': 0.16; 'string': 0.17; 'wrote:': 0.17; 'byte': 0.17; 'pointer': 0.17; 'unicode': 0.17; '>>>': 0.18; 'memory': 0.18; 'bit': 0.21; 'import': 0.21; 'not,': 0.21; 'cc:2**0': 0.23; 'cc:no real name:2**0': 0.24; 'least': 0.25; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'wrote': 0.26; 'am,': 0.27; 'format,': 0.27; 'message- id:@mail.gmail.com': 0.27; 'skip:( 20': 0.28; "d'aprano": 0.29; 'pointer.': 0.29; 'steven': 0.29; 'strings,': 0.29; 'character': 0.29; '"the': 0.29; 'received:209.85.215.46': 0.30; 'point': 0.31; '(and': 0.32; 'structure': 0.32; 'says': 0.33; 'received:google.com': 0.34; 'data,': 0.35; 'exist': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'but': 0.36; 'characters': 0.36; 'should': 0.36; 'skip:p 20': 0.36; 'does': 0.37; 'option': 0.37; 'uses': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'nothing': 0.38; 'header:Received:5': 0.40; 'think': 0.40; 'share': 0.61; 'maximum': 0.63; 'more': 0.63; 'benefit': 0.70; '128,': 0.84; '512': 0.84; 'subject:value': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=U2mvEn1XvtWub6GdUwnwig8F3MVkvwXBaPTY4HbOb98=; b=0oTxHYHs0sDmQtKiUF6cPpJdJWIBkLAjmkDtdRQwdot+oAve4xiU9VVzffBCby8Y/p yFHYB0SQLZm7OGTNw7LTw2374vCoE5NRS23DiyuB2iVJYnWlr7Vv2l8R38tkyS+Y+1lq stbWpvr8VhycKlibXIEc8ahcSYLN2/A8u1rhrtW3PJmXj549drZ3B2EMEuNiJXeKHFL1 iYrjfMlaL6syfP6Xp1W3oY4IL+udyw59wAA1RfeouJrDun/8xD6sBT1qoJ4j4A96NoHW 6vOD04ppfzWe7ZuLh7qENgYCWGAzZyMPfCBbe7RXtsjkfE1BhvSyxzJ4DBfpY7hrXz/G 3+1A== MIME-Version: 1.0 In-Reply-To: <503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com> References: <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <503088b7$0$29978$c3e8da3$5496439d@news.astraweb.com> From: Ian Kelly Date: Sun, 19 Aug 2012 11:50:12 -0600 Subject: Re: How do I display unicode value stored in a string variable using ord() To: "Steven D'Aprano" Content-Type: text/plain; charset=ISO-8859-1 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 53 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1345398650 news.xs4all.nl 6923 [2001:888:2000:d::a6]:40293 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:27407 On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano wrote: > On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: >> There is some additional benefit for Latin-1 users, but this has nothing >> to do with Python. If Python is going to have the option of a 1-byte >> representation (and as long as we have the flexible representation, I >> can see no reason not to), > > The PEP explicitly states that it only uses a 1-byte format for ASCII > strings, not Latin-1: I think you misunderstand the PEP then, because that is empirically false. Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getsizeof(bytes(range(256)).decode('latin1')) 329 The constructed string contains all 256 Latin-1 characters, so if Latin-1 strings must be stored in the 2-byte format, then the size should be at least 512 bytes. It is not, so I think it must be using the 1-byte encoding. > "ASCII-only Unicode strings will again use only one byte per character" This says nothing one way or the other about non-ASCII Latin-1 strings. > "If the maximum character is less than 128, they use the PyASCIIObject > structure" Note that this only describes the structure of "compact" string objects, which I have to admit I do not fully understand from the PEP. The wording suggests that it only uses the PyASCIIObject structure, not the derived structures. It then says that for compact ASCII strings "the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data." But these fields are part of the PyCompactUnicodeObject structure, not the base PyASCIIObject structure, so they would not exist if only PyASCIIObject were used. It would also imply that compact non-ASCII strings are stored internally as UTF-8, which would be surprising. > and: > > "The data and utf8 pointers point to the same memory if the string uses > only ASCII characters (using only Latin-1 is not sufficient)." This says that if the data are ASCII, then the 1-byte representation and the utf8 pointer will share the same memory. It does not imply that the 1-byte representation is not used for Latin-1, only that it cannot also share memory with the utf8 pointer.