Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(except': 0.05; 'strings.': 0.07; 'width': 0.07; 'api': 0.09; 'called.': 0.09; 'compact': 0.09; 'form?': 0.09; 'non-ascii': 0.09; 'size)': 0.09; 'subject:string': 0.09; 'utf8': 0.09; 'stored': 0.10; 'aug': 0.13; 'comments:': 0.16; 'construct.': 0.16; 'denotes': 0.16; 'merely': 0.16; 'set,': 0.16; 'structure.': 0.16; 'subject:unicode': 0.16; 'string': 0.17; 'wrote:': 0.17; 'byte': 0.17; 'pointed': 0.17; 'pointer': 0.17; 'memory': 0.18; 'skip:p 30': 0.20; '31,': 0.22; 'struct': 0.22; 'cheers,': 0.23; 'dependent': 0.23; 'somewhere': 0.24; 'header:In-Reply-To:1': 0.25; 'am,': 0.27; '(as': 0.27; 'functions.': 0.27; 'opposed': 0.27; 'message-id:@mail.gmail.com': 0.27; "d'aprano": 0.29; 'pointer.': 0.29; 'steven': 0.29; 'strings,': 0.29; 'character': 0.29; "i'm": 0.29; 'fri,': 0.30; 'relative': 0.30; 'structure': 0.32; 'comments': 0.33; 'allocated': 0.33; 'legacy': 0.33; 'to:addr:python-list': 0.33; 'equal': 0.33; 'received:google.com': 0.34; 'received:209.85': 0.35; 'created': 0.36; 'data.': 0.36; 'skip:p 20': 0.36; 'does': 0.37; 'being': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'sure': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'skip:u 10': 0.60; 'subject:, ': 0.61; 'subject:...': 0.63; 'become': 0.65; 'subject:, ...': 0.84; 'to:name:python': 0.84; 'unclear': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=eqxM+ULliA7KY0bXMZt0tOiAsoo2EFkaCfpwhMKhs7U=; b=HcIPqM//+rS5R1rbMH7fk1cYdf0rSez+re/jwnhytc2Wz3CigHdTeKj2SiZSxfQUSL snjwqWy9AgJWEMU41ESiuqlPoek60pznM30qFUIW+ZRlyYdpzKfx5SrnmVcrRQC46ILu WGcyWe2LDZ588kt+mXr7/K1o+KQwoVKX9lPv5scZuAV5352eplVC+MJjTUQZbOrOOEk/ qvr8KmzsZcMI2KdIaxgM94dIf30gN0hZ/edpZOgXV+YjtJPTgenGYg1MyKXionNuilBj KH8YSJPlEpoQ8RR6ritisxTP73sTRz0CFuxm1toHTV7m5aJPSM6AcS27YQ9e4cbf+j6g f0jw== MIME-Version: 1.0 In-Reply-To: <5040aed8$0$29978$c3e8da3$5496439d@news.astraweb.com> References: <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> <62566024-df1d-4948-a27a-45c7820ddc6c@googlegroups.com> <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com> <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com> <5040aed8$0$29978$c3e8da3$5496439d@news.astraweb.com> From: Ian Kelly Date: Fri, 31 Aug 2012 09:13:40 -0600 Subject: Re: Flexible string representation, unicode, typography, ... To: Python Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 42 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1346426052 news.xs4all.nl 6945 [2001:888:2000:d::a6]:47801 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:28182 On Fri, Aug 31, 2012 at 6:32 AM, Steven D'Aprano wrote: > That's one thing that I'm unclear about -- under what circumstances will > a string be in compact versus non-compact form? I understand it to be entirely dependent on which API is used to construct. The legacy API generates legacy strings, and the new API generates compact strings. From the comments in unicodeobject.h: /* ASCII-only strings created through PyUnicode_New use the PyASCIIObject structure. state.ascii and state.compact are set, and the data immediately follow the structure. utf8_length and wstr_length can be found in the length field; the utf8 pointer is equal to the data pointer. */ ... Legacy strings are created by PyUnicode_FromUnicode() and PyUnicode_FromStringAndSize(NULL, size) functions. They become ready when PyUnicode_READY() is called. ... /* Non-ASCII strings allocated through PyUnicode_New use the PyCompactUnicodeObject structure. state.compact is set, and the data immediately follow the structure. */ Since I'm not sure that this is clear, note that compact vs. legacy does not describe which character width is used (except that PyASCIIObject strings are always 1 byte wide). Legacy and compact strings can each use the 1, 2, or 4 byte representations. "Compact" merely denotes that the character data is stored inline with the struct (as opposed to being stored somewhere else and pointed at by the struct), not the relative size of the string data. Again from the comments: Compact strings use only one memory block (structure + characters), whereas legacy strings use one block for the structure and one block for characters. Cheers, Ian