Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: Re: Unicode and Python - how often do you index strings?
Date: Wed, 04 Jun 2014 12:10:41 +0200
Organization: None
References: <CAPTjJmr4iHdaCy61w2rz-oL6FcarRzzTeEU44Fxn2Z=gS0fh-Q@mail.gmail.com> <lmmkvk$73h$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8Bit
User-Agent: KNode/4.11.5
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.10695.1401876662.18130.python-list@python.org>
Lines: 70
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:72622

Mark Lawrence wrote:

> On 04/06/2014 01:39, Chris Angelico wrote:
>> A current discussion regarding Python's Unicode support centres (or
>> centers, depending on how close you are to the cent[er]{2} of the
>> universe) around one critical question: Is string indexing common?
>>
>> Python strings can be indexed with integers to produce characters
>> (strings of length 1). They can also be iterated over from beginning
>> to end. Lots of operations can be built on either one of those two
>> primitives; the question is, how much can NOT be implemented
>> efficiently over iteration, and MUST use indexing? Theories are great,
>> but solid use-cases are better - ideally, examples from actual
>> production code (actual code optional).
>>
>> I know the collective experience of python-list can't fail to bring up
>> a few solid examples here :)
>>
>> Thanks in advance, all!!
>>
>> ChrisA
>>
> 
> Single characters quite often, iteration rarely if ever, slicing all the
> time, but does that last one count?

The indices used for slicing typically don't come out of nowhere. A simple 
example would be

def strip_prefix(text, prefix):
    if text.startswith(prefix):
        text = text[len(prefix):] 
    return text

If both prefix and text use UTF-8 internally the byte offset is already 
known. The question is then how we can preserve that information.

The first approach that comes to mind is an int subtype:

>>> for i, c in enumerate("123αλφα"):
...     print(i, byteoffset(i), c)
... 
0 0 1
1 1 2
2 2 3
3 3 α
4 5 λ
5 7 φ
6 9 α

This would work in the strip_prefix() example, but lead to data corruption 
in most other cases unless limited to a specific string -- in which case it 
would no longer work with strip_prefix().

So a new interface would be needed. My second try, an object with two byte 
offsets linked to a specific string:

>>> span("foobar").startswith("oob")
>>> p = span("foobar").startswith("foo")
>>> p.replace("baz")
'bazbar'
>>> p.before()
''
>>> p.after()
'bar'
>>> span("foo bar baz").find("bar").replace("spam")
'foo spam bar'

I have no idea if that could work out...