Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'elif': 0.05; 'subject:Python': 0.06; 'indices': 0.07; 'method.': 0.07; 'utf-8': 0.07; 'string': 0.09; 'iterate': 0.09; 'prefix': 0.09; 'cc:addr :python-list': 0.11; 'def': 0.12; 'complicated,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'index.': 0.16; 'index;': 0.16; 'integer,': 0.16; 'internally': 0.16; 'preserve': 0.16; 'subject:Unicode': 0.16; 'index': 0.16; 'wrote:': 0.18; 'wed,': 0.18; '(but': 0.19; 'example': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'instance,': 0.24; 'question': 0.24; 'cc:2**0': 0.24; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'message-id:@mail.gmail.com': 0.30; 'asked': 0.31; 'code': 0.31; 'too.': 0.31; 'text': 0.33; 'position.': 0.33; 'could': 0.34; 'problem': 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'done': 0.36; 'subject:?': 0.36; 'operating': 0.37; 'being': 0.38; 'pm,': 0.38; 'how': 0.40; 'simple': 0.61; 'first': 0.61; 'more': 0.64; 'direct': 0.67; 'lose': 0.68; 'special': 0.74; 'exclusive': 0.81; 'inclusive': 0.84; 'off,': 0.84; 'otten': 0.84; 'weaker': 0.84; 'subject:you': 0.87; 'to:none': 0.92; 'imagine': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=oR530J7MiV8J2L+7h/niRzfk+21H0mqq9nGrOofWhUM=; b=mzacc1YFJdMUCgJJ6jo1dri3OyOEPcLvY9SLzx4lz6IJ3WgxxXFIk4gh8dF1f1hg/4 XjWA4CpPIICoGeZv5r5ZG9iFi88Dl0lfyWxI6pREPnvHH2tY1a1oQuwY7nczCmsXzc8t tQHFj3Mj0eAlWhIT1J+o60R0nxRWq7bqsaTVTVI7MSwsxH1TpSwcJGaFTC4MvC6zKupl k4Q0+q7f58f1qhiwfkuLD+K1JY8srJ9tQQosmhC8XMMSbvvCpYF4vtMFolMo40N2Ur3N rAK+e5ZDkXjdsLX4gGH5jZV9BrL8LcInZ+QTs2OhjyllJbV23DOR1Vup8zPEnBtvt4JN 5UYw== MIME-Version: 1.0 X-Received: by 10.52.93.201 with SMTP id cw9mr662750vdb.80.1401878649976; Wed, 04 Jun 2014 03:44:09 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 Jun 2014 20:44:09 +1000 Subject: Re: Unicode and Python - how often do you index strings? From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 32 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1401878658 news.xs4all.nl 2908 [2001:888:2000:d::a6]:52967 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72623 On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten <__peter__@web.de> wrote: > The indices used for slicing typically don't come out of nowhere. A simple > example would be > > def strip_prefix(text, prefix): > if text.startswith(prefix): > text = text[len(prefix):] > return text > > If both prefix and text use UTF-8 internally the byte offset is already > known. The question is then how we can preserve that information. Almost completely useless. First off, it solves only the problem of operating on the string at exactly some point where you just got an index; and secondly, you don't always get that index from a string method. Suppose, for instance, that you iterate over a string thus: for i, ch in enumerate(string): if ch=='{': start = i elif ch=='}': return string[start:end+1] Okay, so this could be done by searching, but for something more complicated, I can imagine it being better to enumerate. (But "I can imagine" is much weaker than "Here's code that we use in production", which is why I asked the question.) Incidentally, the above code highlights the first problem too. With direct indexing, you can ask for inclusive or exclusive slicing by adding or subtracting one from the index. If you do that with a byte-position-retaining special integer, you lose the byte position. ChrisA