Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'string.': 0.05; 'subject:Python': 0.06; '"""': 0.07; 'utf-8': 0.07; 'string': 0.09; 'absent': 0.09; 'ascii': 0.09; 'data:': 0.09; 'happen.': 0.09; 'null,': 0.09; 'pep': 0.09; 'strings.': 0.09; 'subject: [': 0.09; 'subject:string': 0.09; '"in': 0.16; 'absent,': 0.16; 'btw:': 0.16; 'created.': 0.16; 'differs': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'length)': 0.16; 'pairs': 0.16; 'pairs,': 0.16; 'surrogate': 0.16; 'typos': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'bit': 0.19; "python's": 0.19; 'version.': 0.19; '(in': 0.22; 'this?': 0.23; 'pointer': 0.24; 'string,': 0.24; 'unicode': 0.24; 'subject:/': 0.26; 'header:In-Reply-To:1': 0.27; 'correct': 0.29; 'am,': 0.29; 'respective': 0.29; 'wonder': 0.29; 'message-id:@mail.gmail.com': 0.30; "d'aprano": 0.31; 'minor': 0.31; 'steven': 0.31; 'file': 0.32; 'fri,': 0.33; 'sense': 0.34; 'could': 0.34; "can't": 0.35; 'received:209.85': 0.35; 'created': 0.35; 'possible.': 0.35; 'received:209.85.220': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'surely': 0.36; 'should': 0.36; 'received:209': 0.37; 'being': 0.38; 'subject:new': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; 'though,': 0.39; 'to:addr:python.org': 0.39; 'most': 0.60; 'length': 0.61; 'full': 0.61; 'field': 0.63; 'skip:n 10': 0.64; 'hang': 0.67; 'mar': 0.68; 'power': 0.76; 'premature': 0.84; 'presumably': 0.84; 'subject:long': 0.84; 'canonical': 0.91; 'cast': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=kzOrhXEx6OG1tIMDmZRLezhd9NlKMHCAvByVIXkojM8=; b=WUEbnXeayUNxUh92HFkRjrhnJ20gt57EQ7EAxR4nXG5Fkdf0brJjqwlwha33qfSZup WftWJkmHW8ussN4RrTIgxmNjFNE7KytdYThqdepNul87pbTfCE6bQCmwCB+xXIAWDd1B JJZ7H9W7vOV/5yxEDrdhqHXs5VbYhAcPL12LQBqEZJfPx6JTD6lhHLksAkN6tvNrM/OT Aq74mSqhdG6kZAYA0+18/GZXQDXLZxL8tYaEYFHXPAxZHSChmX9EE7qEjf9He7bbeMeH rYlQOixI+uRykwfJaYsA/NBSAB5otqXCamlXVVX/4BaUbmU7fv28l7PErKvk3BgmOA8+ 2ymg== MIME-Version: 1.0 X-Received: by 10.52.88.197 with SMTP id bi5mr529298vdb.58.1364518481708; Thu, 28 Mar 2013 17:54:41 -0700 (PDT) In-Reply-To: <5154e2dd$0$29974$c3e8da3$5496439d@news.astraweb.com> References: <0b779c80-4f50-4716-8c30-47755c15f304@m12g2000yqp.googlegroups.com> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <51543f45$0$29998$c3e8da3$5496439d@news.astraweb.com> <944f195c-cbfe-47e1-a963-05fe3d98238d@5g2000yqz.googlegroups.com> <5154e2dd$0$29974$c3e8da3$5496439d@news.astraweb.com> Date: Fri, 29 Mar 2013 11:54:41 +1100 Subject: Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]] From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 50 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1364518484 news.xs4all.nl 6913 [2001:888:2000:d::a6]:47085 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:42209 On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano wrote: > ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only > strings. It's only strings in the SMPs that could need surrogate pairs, > and they don't need them in Python's implementation since it's a full 32- > bit implementation. So where do the surrogate pairs come into this? PEP 393 says: """ wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). wstr_length differs from length only if there are surrogate pairs in the representation. utf8_length, utf8: UTF-8 representation (null-terminated). data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation). All three representations are optional, although the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data. """ If the string was created from a wchar_t string, that string will be retained, and presumably can be used to re-output the original for a clean and fast round-trip. Same with... > I also wonder why the implementation bothers keeping a UTF-8 > representation. That sounds like premature optimization to me. Surely you > only need it when writing to a file with UTF-8 encoding? For most > strings, that will never happen. ... the UTF-8 version. It'll keep it if it has it, and not else. A lot of content will go out in the same encoding it came in in, so it makes sense to hang onto it where possible. Though, from the same quote: The UTF-8 representation is null-terminated. Does this mean that it can't be used if there might be a \0 in the string? Minor nitpick, btw: > (in which cast wstr_length differs form length) Should be "in which case" and "from". Who has the power to correct typos in PEPs? ChrisA