Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feed.xsnews.nl!border-1.ams.xsnews.nl!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'subject:Python': 0.05; 'php.': 0.07; 'utf-8': 0.07; 'received:mail-vc0-f174.google.com': 0.09; 'semantics': 0.09; 'sep': 0.09; 'splitting': 0.09; 'underlying': 0.09; 'index': 0.13; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'limiting': 0.16; 'string': 0.17; 'wrote:': 0.17; 'byte': 0.17; 'thu,': 0.17; 'unicode': 0.17; '3.2': 0.22; 'elements': 0.23; 'split': 0.23; 'paul': 0.24; 'header:In-Reply-To:1': 0.25; 'fit': 0.26; 'am,': 0.27; 'wonder': 0.27; 'message-id:@mail.gmail.com': 0.27; 'all.': 0.28; 'chris': 0.28; 'build,': 0.29; 'index,': 0.29; 'received:209.85.220.174': 0.29; 'writes:': 0.29; 'character': 0.29; 'worked': 0.30; 'could': 0.32; 'rid': 0.33; 'to:addr:python-list': 0.33; "can't": 0.34; 'received:google.com': 0.34; 'done': 0.34; 'so,': 0.35; 'received:209.85.220': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'really': 0.36; 'but': 0.36; 'compare': 0.36; 'should': 0.36; 'received:209': 0.37; 'subject:: ': 0.38; 'some': 0.38; 'to:addr:python.org': 0.39; 'easily': 0.39; 'where': 0.40; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'easy': 0.60; 'skip:u 10': 0.60; 'real': 0.61; 'places': 0.61; 'wide': 0.62; 'world': 0.63; 'times': 0.63; 'our': 0.65; 'unusual': 0.71; 'streams': 0.84; 'checks.': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=meoz3OpmlFmR6cpl/gRCBrXBkk9t8UvjccfEjAUshq4=; b=Sk/SKZHngAzAcCw9p8c7uXjVhyKHwrxO3uLzTSXacEbxKTCLN/BJTIYdPPbrvPwXGi O2iISqxfDVA6DxXTe/GlscTYbWbUqDCJQSarWCwSndDL4KiPlC9Fhwd/xZ4f8P2vTM5w LJOXRXMwXXaHJDnJ7cZ5TJRPLt7KxpbxIKQljRJ134o1qKbILO5SwjHlAC3vFoAvIoMN ODi0m8SFbqn2Gn9XNrzTGYs7RIk+NCuHPAX25AyLGljPVOL67oPoSKZPaHtrjx3jDpVq 6AQ6h870XQ9zTD6hLit8yVkf607Mvu0QUiFsjY5NwoKJFvO1icD0eIlY3J1PwRSS90MT qXkQ== MIME-Version: 1.0 In-Reply-To: <7xmx0cg204.fsf@ruckus.brouhaha.com> References: <5062ad83$0$29997$c3e8da3$5496439d@news.astraweb.com> <693ac61b-b1d3-4192-9e50-5166fd119278@googlegroups.com> <7xmx0cg204.fsf@ruckus.brouhaha.com> Date: Thu, 27 Sep 2012 03:04:50 +1000 Subject: Re: Article on the future of Python From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 30 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1348679093 news.xs4all.nl 6989 [2001:888:2000:d::a6]:50216 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:30228 On Thu, Sep 27, 2012 at 2:52 AM, Paul Rubin wrote: > Chris Angelico writes: >> When you compare against a wide build, semantics of 3.2 and 3.3 are >> identical, and then - and ONLY then - can you sanely compare >> performance. And 3.3 stacks up much better. > > I like to have seen real world benchmarks against a pure UTF-8 > implementation. That means O(n) access to the n'th character of a > string which could theoretically slow some programs down terribly, but I > wonder how often that actually matters in ways that can't easily be > worked around. That's pretty much what we have with the PHP parts of our web site. We've decreed that everything should be UTF-8 byte streams (actually, it took some major campaigning from me to get rid of the underlying thinking that "binary-safe" and "UTF-8" and "characters" and so on were all equivalent), but there are very few places where we actually index strings in PHP. There's a small amount of parsing, but it's all done by splitting on particular strings - if you search for 0x0A in a UTF-8 bytestream and split at that index, it's the same as searching for U+000A in a Unicode string and splitting there - and all of our structural elements fit inside ASCII. The few times we actually care about character length (eg limiting user-specified rule names to N characters), we don't much care about performance, because they're unusual checks. So, I don't actually have any stats for you, because it's really easy to just not index strings at all. ChrisA