Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #72646

Re: Unicode and Python - how often do you index strings?

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'elif': 0.05; 'subject:Python': 0.06; 'indicating': 0.07; 'indices': 0.07; 'method.': 0.07; 'utf-8': 0.07; 'string': 0.09; 'iterate': 0.09; 'naturally': 0.09; 'prefix': 0.09; 'prevents': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'def': 0.12; 'wrote': 0.14; 'complicated,': 0.16; 'created.': 0.16; 'immutable,': 0.16; 'index.': 0.16; 'index;': 0.16; 'integer,': 0.16; 'internally': 0.16; 'mutable': 0.16; 'pairs': 0.16; 'pairs,': 0.16; 'preserve': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:Unicode': 0.16; 'substring': 0.16; 'threshold': 0.16; 'elements': 0.16; 'index': 0.16; 'wrote:': 0.18; 'wed,': 0.18; '(but': 0.19; 'written': 0.21; 'example': 0.22; 'byte': 0.24; 'instance,': 0.24; 'string,': 0.24; 'tend': 0.24; 'visible': 0.24; 'question': 0.24; 'references': 0.26; 'header:X-Complaints-To:1': 0.27; 'point': 0.28; 'chris': 0.29; 'array': 0.29; 'asked': 0.31; 'code': 0.31; 'too.': 0.31; 'calculated': 0.31; 'flags': 0.31; 'text': 0.33; 'position.': 0.33; 'could': 0.34; 'problem': 0.35; 'something': 0.35; 'but': 0.35; 'there': 0.35; 'done': 0.36; 'subject:?': 0.36; 'operating': 0.37; 'two': 0.37; 'performance': 0.37; 'being': 0.38; 'to:addr :python-list': 0.38; 'pm,': 0.38; 'recent': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'even': 0.60; 'most': 0.60; 'simple': 0.61; 'first': 0.61; 'such': 0.63; 'more': 0.64; 'different': 0.65; 'size.': 0.65; 'direct': 0.67; 'lose': 0.68; 'special': 0.74; '100': 0.79; 'exclusive': 0.81; 'exceeding': 0.84; 'inclusive': 0.84; 'off,': 0.84; 'otten': 0.84; 'received:myvzw.com': 0.84; 'weaker': 0.84; 'subject:you': 0.87; 'imagine': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Dave Angel <davea@davea.name>
Subject Re: Unicode and Python - how often do you index strings?
Date Wed, 4 Jun 2014 11:50:49 -0500 (CDT)
Organization news.gmane.org
References <CAPTjJmr4iHdaCy61w2rz-oL6FcarRzzTeEU44Fxn2Z=gS0fh-Q@mail.gmail.com> <lmmkvk$73h$1@ger.gmane.org> <lmmrb2$klg$1@ger.gmane.org> <CAPTjJmp3B=eAVqNr9R+-g=E2gd4Wk7xLM8BLMTmQPwd+0-7x9g@mail.gmail.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host 170.sub-70-196-82.myvzw.com
X-Newsreader PiaoHong.NewsGroup.Client.VIP:1.53
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.10708.1401900618.18130.python-list@python.org> (permalink)
Lines 59
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1401900618 news.xs4all.nl 2941 [2001:888:2000:d::a6]:37045
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:72646

Show key headers only | View raw


Chris Angelico <rosuav@gmail.com> Wrote in message:
> On Wed, Jun 4, 2014 at 8:10 PM, Peter Otten <__peter__@web.de> wrote:
>> The indices used for slicing typically don't come out of nowhere. A simple
>> example would be
>>
>> def strip_prefix(text, prefix):
>>     if text.startswith(prefix):
>>         text = text[len(prefix):]
>>     return text
>>
>> If both prefix and text use UTF-8 internally the byte offset is already
>> known. The question is then how we can preserve that information.
> 
> Almost completely useless. First off, it solves only the problem of
> operating on the string at exactly some point where you just got an
> index; and secondly, you don't always get that index from a string
> method. Suppose, for instance, that you iterate over a string thus:
> 
> for i, ch in enumerate(string):
>     if ch=='{': start = i
>     elif ch=='}': return string[start:end+1]
> 
> Okay, so this could be done by searching, but for something more
> complicated, I can imagine it being better to enumerate. (But "I can
> imagine" is much weaker than "Here's code that we use in production",
> which is why I asked the question.)
> 
> Incidentally, the above code highlights the first problem too. With
> direct indexing, you can ask for inclusive or exclusive slicing by
> adding or subtracting one from the index. If you do that with a
> byte-position-retaining special integer, you lose the byte position.
> 
> ChrisA
> 

A string could have two extra fields in it that hold index and
 offset for the most recent substring reference.  Even though the
 string is immutable,  nothing prevents mutable elements that are
 externally visible only by performance measurement.
 

So a loop using a subscript of a string would tend to be faster
 even if written in a naive way.

It's also conceivable to build an array of such pairs in strings
 over a threshold size. So if you had a megabyte string, there
 might be 100 evenly spaced pairs, calculated when the string
 object is first created.

And naturally there can be flags indicating that the particular
 string is pure ASCII.

Clearly this breaks down if there are two alternating references
 at different offsets, but I think this would be exceeding
 rare.

-- 
DaveA

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Unicode and Python - how often do you index strings? Dave Angel <davea@davea.name> - 2014-06-04 11:50 -0500

csiph-web