Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'subject:Python': 0.06; '"""': 0.07; 'utf-8': 0.07; 'string': 0.09; 'apis': 0.09; 'converts': 0.09; 'created,': 0.09; 'mentions': 0.09; 'pep': 0.09; 'subject: [': 0.09; 'subject:string': 0.09; 'api': 0.11; 'caching': 0.16; 'called.': 0.16; 'compute': 0.16; 'time).': 0.16; 'utf8': 0.16; 'wrote:': 0.18; 'memory': 0.22; '(such': 0.24; 'bytes': 0.24; 'removed.': 0.24; 'subject:/': 0.26; 'skip:_ 20': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'specifically': 0.29; 'am,': 0.29; 'thus': 0.29; 'message- id:@mail.gmail.com': 0.30; '(which': 0.31; 'motivation': 0.31; 'object.': 0.31; 'probably': 0.32; 'call.': 0.33; 'fri,': 0.33; 'received:209.85': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'possible': 0.36; 'should': 0.36; 'received:209': 0.37; 'subject:new': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'ian': 0.60; 'new': 0.61; 'first': 0.61; '(that': 0.65; 'kept': 0.65; 'mar': 0.68; 'functions)': 0.84; 'subject:long': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=+gPHIU2wrKMvGx25orpU0XnuIfiHaF8U3n9WKEucweo=; b=v7MoSxL+Ne8nTOhPUcaLLBXzu4o83rqAnWrRSMNt8h92xq+3lGKAc9VQnjFWPe+w57 VDaLH/TMx9Ta7hsP+TtzXg5SYyz09cL8s80A2pjEz0kpPe6JFQwmUcYW1fYlQJ64Apay 6ggfqEj9dML3Rc61htKYlB43CHpgXglJ3m3K390dacRFnPZ9r1Wc5DNT+BpdvT9l7P+y CUgGiy9Rp164kWY37orJwwrK/rqxk/4AvQIrS7c5seae6KBfiuBH45BHq5EvMnB4+Mak hKcos/z2DjMdUDCwrnVXCOPvIAlxYa/iCt7/rdFQSIoH9w02WDMlrcvgexBOQAVmdPsC pF9w== X-Received: by 10.58.134.114 with SMTP id pj18mr1003491veb.36.1364538168359; Thu, 28 Mar 2013 23:22:48 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <51543f45$0$29998$c3e8da3$5496439d@news.astraweb.com> <944f195c-cbfe-47e1-a963-05fe3d98238d@5g2000yqz.googlegroups.com> <5154e2dd$0$29974$c3e8da3$5496439d@news.astraweb.com> <5154fe82$0$29974$c3e8da3$5496439d@news.astraweb.com> From: Ian Kelly Date: Fri, 29 Mar 2013 00:22:08 -0600 Subject: Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]] To: Python Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 26 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1364538176 news.xs4all.nl 6897 [2001:888:2000:d::a6]:40333 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:42228 On Fri, Mar 29, 2013 at 12:11 AM, Ian Kelly wrote: > From the PEP: > > """ > A new function PyUnicode_AsUTF8 is provided to access the UTF-8 > representation. It is thus identical to the existing > _PyUnicode_AsString, which is removed. The function will compute the > utf8 representation when first called. Since this representation will > consume memory until the string object is released, applications > should use the existing PyUnicode_AsUTF8String where possible (which > generates a new string object every time). APIs that implicitly > converts a string to a char* (such as the ParseTuple functions) will > use PyUnicode_AsUTF8 to compute a conversion. > """ > > So the utf8 representation is not populated when the string is > created, but when a utf8 representation is requested, and only when > requested by the API that returns a char*, not by the API that returns > a bytes object. Since the PEP specifically mentions ParseTuple string conversion, I am thinking that this is probably the motivation for caching it. A string that is passed into a C function (that uses one of the various UTF-8 char* format specifiers) is perhaps likely to be passed into that function again at some point, so the UTF-8 representation is kept around to avoid the need to recompose it at on each call.