Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!news.musoftware.de!wum.musoftware.de!news.babsi.de!open-news-network.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'compiler': 0.05; '-1)': 0.07; 'ascii': 0.07; 'rejected': 0.07; 'repeated': 0.07; 'python': 0.09; 'expectation': 0.09; 'internally': 0.09; 'lengths': 0.09; 'pep': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'result;': 0.09; 'subject:string': 0.09; 'terry': 0.09; 'stored': 0.10; 'aug': 0.13; '*v;': 0.16; '3.3.': 0.16; 'ascii,': 0.16; 'discussion.': 0.16; 'operation,': 0.16; 'pyobject': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'reedy': 0.16; 'roy': 0.16; 'subject:unicode': 0.16; 'unsafe': 0.16; 'wider': 0.16; 'string': 0.17; 'wrote:': 0.17; 'thu,': 0.17; 'unicode': 0.17; 'jan': 0.18; '>>>': 0.18; 'code.': 0.20; 'skip:p 30': 0.20; 'mostly': 0.20; 'earlier': 0.21; 'constant': 0.22; 'null;': 0.22; 'visible': 0.22; 'needed.': 0.23; 'least': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'checking': 0.27; 'header:X-Complaints-To:1': 0.28; 'skip:( 20': 0.28; 'initial': 0.28; 'comparison': 0.29; "d'aprano": 0.29; 'smart': 0.29; 'steven': 0.29; 'str': 0.29; 'strings,': 0.29; 'objects': 0.29; 'code': 0.31; 'url:python': 0.32; 'function.': 0.33; 'int': 0.33; 'true.': 0.33; 'to:addr :python-list': 0.33; 'done': 0.34; 'false': 0.35; 'faster': 0.35; 'pm,': 0.35; 'there': 0.35; 'received:org': 0.36; 'but': 0.36; 'url:org': 0.36; 'skip:p 20': 0.36; 'enough': 0.36; 'optimization': 0.37; 'two': 0.37; 'rather': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'things': 0.38; 'sure': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'where': 0.40; 'header:Received:5': 0.40; 'subject:, ': 0.61; 'kind': 0.61; 'relatively': 0.62; 'mentioned': 0.63; 'necessarily': 0.63; 'different': 0.63; 'subject:...': 0.63; 'here': 0.65; 'treat': 0.65; 'ago.': 0.66; 'smith': 0.71; 'url:c': 0.77; 'article': 0.78; 'balanced': 0.84; 'fast,': 0.84; 'received:fios.verizon.net': 0.84; 'subject:, ...': 0.84; 'url:cpython': 0.84; 'canonical': 0.91; 'faster.': 0.91; 'same,': 0.91 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Terry Reedy Subject: Re: Flexible string representation, unicode, typography, ... Date: Thu, 30 Aug 2012 16:44:32 -0400 References: <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> <62566024-df1d-4948-a27a-45c7820ddc6c@googlegroups.com> <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com> <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: pool-173-75-251-66.phlapa.fios.verizon.net User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120824 Thunderbird/15.0 In-Reply-To: <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 73 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1346359524 news.xs4all.nl 6858 [2001:888:2000:d::a6]:34359 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:28140 On 8/30/2012 12:00 PM, Steven D'Aprano wrote: > On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote: > >> In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>, >> Steven D'Aprano wrote: >> >>> The only thing which is innovative here is that instead of the Python >>> compiler declaring that "all strings will be stored in UCS-2", the >>> compiler chooses an implementation for each string as needed. So some >>> strings will be stored internally as UCS-4, some as UCS-2, and some as >>> ASCII (which is a standard, but not the Unicode consortium's standard). >> >> Is the implementation smart enough to know that x == y is always False >> if x and y are using different internal representations? Yes, after checking lengths, and in same circumstances, x != y is True. From http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c PyObject * PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) { int result; if (PyUnicode_Check(left) && PyUnicode_Check(right)) { PyObject *v; if (PyUnicode_READY(left) == -1 || PyUnicode_READY(right) == -1) return NULL; if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) || PyUnicode_KIND(left) != PyUnicode_KIND(right)) { if (op == Py_EQ) { Py_INCREF(Py_False); return Py_False; } if (op == Py_NE) { Py_INCREF(Py_True); return Py_True; } } ... KIND is 1,2,4 bytes/char 'a in s' is also False if a chars are wider than s chars. If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast, constant time operation, as I showed earlier in this discussion. This is one thing that is much faster in 3.3. Such things can be tested by timing with different lengths of strings, where the initial string creation is done in setup code rather than in the repeated operation code. > But x and y are not necessarily always False just because they have > different representations. There may be circumstances where two strings > have different internal representations even though their content is the > same, so it's an unsafe optimization to automatically treat them as > unequal. I am sure that str objects are always in canonical form once visible to Python code. Note that unready (non-canonical) objects are rejected by the rich comparison function. > My expectation is that the initial implementation of PEP 393 will be > relatively unoptimized, The initial implementation was a year ago. At least three people have expended considerable effort improving it since, so that the slowdown mentioned in the PEP has mostly disappeared. The things that are still slower are somewhat balanced by things that are faster. -- Terry Jan Reedy