Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'subject:Python': 0.06; '"""': 0.07; 'encoded': 0.07; 'utf-8': 0.07; 'string': 0.09; 'apis': 0.09; 'converts': 0.09; 'created,': 0.09; 'happen.': 0.09; 'subject: [': 0.09; 'subject:string': 0.09; 'api': 0.11; 'called.': 0.16; 'compute': 0.16; 'doubles': 0.16; 'encoding.': 0.16; 'time).': 0.16; 'uses,': 0.16; 'utf8': 0.16; 'do,': 0.16; 'wrote:': 0.18; 'thu,': 0.19; 'version.': 0.19; 'seems': 0.21; '>>>': 0.22; 'memory': 0.22; '(such': 0.24; 'bytes': 0.24; 'case.': 0.24; 'removed.': 0.24; 'string,': 0.24; 'subject:/': 0.26; 'skip:_ 20': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'thus': 0.29; 'wonder': 0.29; 'message- id:@mail.gmail.com': 0.30; '(which': 0.31; "d'aprano": 0.31; 'object.': 0.31; 'steven': 0.31; 'file': 0.32; 'sense': 0.34; 'received:209.85': 0.35; 'possible.': 0.35; 'received:209.85.220': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'version': 0.36; 'surely': 0.36; 'possible': 0.36; 'should': 0.36; 'received:209': 0.37; 'subject:new': 0.38; 'needed': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'even': 0.60; 'most': 0.60; 'new': 0.61; 'first': 0.61; "you'll": 0.62; 'hang': 0.67; 'mar': 0.68; 'functions)': 0.84; 'premature': 0.84; 'subject:long': 0.84; 'megabytes': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=w2HSgSXQXn9p854H0qWpE9BZ4NM914jj6Pl9uv8h/IU=; b=KeB3unqOoFEdc8DtCyxPdvYQ9I7OWGRMDMIGwKq1jcn4E16t3si97u+NlbvK2akPBt /woXAiQVVogZGQ5OKp7whRG7z5ARrjRvTYi98yqSxyhaXz1jdz+B3np0ayNDPEJesxjd Tp9Dx1z93GQkJc0d+3N0vw16mXWQOjxZ2pq7YnIpWLtjAjocgAkvwFv1KMjmoUhWz3TH eUDxIEpJ4cZJDn0gCMroOZaRg9cIdTj5ceCnxADjAOzKd6cRwyYAl3B9VXiT7wBWgz4V r4g5O+88WrY10DhyXG9u9LO9g8gUqAmXu/DwqWKXUXnoAb+BZqIYLRVZAEgxS0zsQTgw KeGA== X-Received: by 10.220.156.8 with SMTP id u8mr1034690vcw.24.1364537537215; Thu, 28 Mar 2013 23:12:17 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5154fe82$0$29974$c3e8da3$5496439d@news.astraweb.com> References: <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <51543f45$0$29998$c3e8da3$5496439d@news.astraweb.com> <944f195c-cbfe-47e1-a963-05fe3d98238d@5g2000yqz.googlegroups.com> <5154e2dd$0$29974$c3e8da3$5496439d@news.astraweb.com> <5154fe82$0$29974$c3e8da3$5496439d@news.astraweb.com> From: Ian Kelly Date: Fri, 29 Mar 2013 00:11:37 -0600 Subject: Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]] To: Python Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 36 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1364537545 news.xs4all.nl 6856 [2001:888:2000:d::a6]:35880 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:42227 On Thu, Mar 28, 2013 at 8:37 PM, Steven D'Aprano wrote: >>> I also wonder why the implementation bothers keeping a UTF-8 >>> representation. That sounds like premature optimization to me. Surely >>> you only need it when writing to a file with UTF-8 encoding? For most >>> strings, that will never happen. >> >> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot >> of content will go out in the same encoding it came in in, so it makes >> sense to hang onto it where possible. > > Not to me. That almost doubles the size of the string, on the off-chance > that you'll need the UTF-8 encoding. Which for many uses, you don't, and > even if you do, it seems like premature optimization to keep it around > just in case. Encoding to UTF-8 will be fast for small N, and for large > N, why carry around (potentially) multiple megabytes of duplicated data > just in case the encoded version is needed some time? >From the PEP: """ A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). APIs that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion. """ So the utf8 representation is not populated when the string is created, but when a utf8 representation is requested, and only when requested by the API that returns a char*, not by the API that returns a bytes object.