Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(at': 0.04; 'patterns': 0.04; 'explicitly': 0.05; 'subject:Python': 0.06; 'indexing': 0.07; 'matches': 0.07; 'utf-8': 0.07; 'string': 0.09; 'excluding': 0.09; 'front,': 0.09; 'prefix': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'wrote': 0.14; "'-'": 0.16; 'backward': 0.16; 'centers,': 0.16; 'expression,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'internally': 0.16; 'md5': 0.16; 'position)': 0.16; 'roy': 0.16; 'subject:Unicode': 0.16; 'index': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'discussion': 0.18; 'looked': 0.18; 'variable': 0.18; 'wed,': 0.18; 'bit': 0.19; "python's": 0.19; 'thanks.': 0.20; '(the': 0.22; 'code,': 0.22; '(in': 0.22; 'putting': 0.22; 'cc:addr:python.org': 0.22; 'unicode': 0.24; '(or': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'source': 0.25; 'least': 0.26; 'header:In-Reply-To:1': 0.27; 'tried': 0.27; 'chris': 0.29; 'am,': 0.29; 'scanned': 0.29; 'message- id:@mail.gmail.com': 0.30; 'code': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'you?': 0.31; '(although': 0.31; 'bunch': 0.31; 'end,': 0.31; 'question:': 0.31; 'probably': 0.32; 'critical': 0.32; 'cases': 0.33; 'implemented': 0.33; 'third': 0.33; 'maybe': 0.34; 'could': 0.34; 'problem.': 0.35; 'more,': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'subject:?': 0.36; 'application': 0.37; 'two': 0.37; 'awesome': 0.38; 'rather': 0.38; 'how': 0.40; 'back': 0.62; 'times': 0.62; 'our': 0.64; 'here': 0.66; 'bottom': 0.67; 'close': 0.67; 'smith': 0.68; 'article': 0.77; 'points,': 0.84; 'subject:you': 0.87; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=hlJn4eE+bAnRJydwRwm23vOthA5RTVxMVP02peOnI5w=; b=ooK9FA1iiOemwCPVBOI0mfP/vUeeZuA9SXQ86Kz2srgSu9MNU2GQG2/KwxQ6pYUGxT 7MXYxFrAQb4pLI6jC5k7fnrP2T6kD7pMLjSfAJmezZGzCKbyjOqtG0GGe7sdTe2URWAf rZhYNnoSRr/igD6XuL0mIMCSuVagbYpWWygBZ6VYsefOYjotv0ymqhcacvV8P8mqo2w7 P4NhxAxC1NMQbVv37/FutA6nptXfbWy0qAJh81tFu3oOhDK6JOL3vlJ5egwafUqI0y7W z58f+j09UXQeuxhzddYEkgKQs3N7zGYaeCEFZ3hFEQbWcqznAGrkwVFbBJm+P0i7478G Bddw== MIME-Version: 1.0 X-Received: by 10.52.14.130 with SMTP id p2mr11471916vdc.39.1401848025975; Tue, 03 Jun 2014 19:13:45 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 Jun 2014 12:13:45 +1000 Subject: Re: Unicode and Python - how often do you index strings? From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 53 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1401848034 news.xs4all.nl 2882 [2001:888:2000:d::a6]:45930 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72573 On Wed, Jun 4, 2014 at 11:18 AM, Roy Smith wrote: > In article , > Chris Angelico wrote: > >> A current discussion regarding Python's Unicode support centres (or >> centers, depending on how close you are to the cent[er]{2} of the >> universe) > > Um, you mean cent(er|re), don't you? The > pattern you wrote also matches centee and centrr. Maybe there's someone who spells it that way! Let's not be excluding people. That'd be rude. >> around one critical question: Is string indexing common? > > Not in our code. I've got 80008 non-blank lines of Python (2.7) source > handy. I tried a few heuristics to find patterns which might be string > indexing. > > $ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]' > > and then looked them over manually. I see this pattern a bunch of times > (in a single-use script): > > data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4] Slicing is a form of indexing too, although in this case (slicing from the front) it could be implemented on top of UTF-8 without much problem. > withhyphen = number if '-' in number else (number[:-2] + '-' + > number[-2:]) # big assumption here This *definitely* counts; if strings were represented internally in UTF-8, this would involve two scans (although a smart implementation could probably count backward rather than forward). By the way, any time you slice up to the third from the end, you win two extra awesome points, just for putting [:-3] into your code and having it mean something. But I digress. > Anyway, there's a bunch more, but the bottom line is that in our code, > indexing into a string (at least explicitly in application source code) > is a pretty rare thing. Thanks. Of course, the pattern you searched for is looking only for literals; it's a bit harder to find cases where the index (or slice position) comes from a variable or expression, and those situations are also rather harder to optimize (the MD5 prefix is clearly better scanned from the front, the number tail is clearly better scanned from the back - but with a variable?). ChrisA