Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <5154590C.9030902@mrabarnett.plus.com>
References: <mailman.3703.1364248275.2939.python-list@python.org> <a52fbe9d-db14-4ed2-bb49-adfb4b56f973@k4g2000yqn.googlegroups.com> <mailman.3771.1364324590.2939.python-list@python.org> <0b779c80-4f50-4716-8c30-47755c15f304@m12g2000yqp.googlegroups.com> <kit1kg$g2u$1@ger.gmane.org> <nad-98F0A4.17004226032013@news.gmane.org> <kitdqr$4m4$2@ger.gmane.org> <nad-8CB9C0.18315026032013@news.gmane.org> <mailman.3805.1364385073.2939.python-list@python.org> <5153a12d$0$29998$c3e8da3$5496439d@news.astraweb.com> <mailman.3845.1364441182.2939.python-list@python.org> <d2cc443a-e049-42ed-abc6-66b5ea600fe7@j1g2000pbq.googlegroups.com> <mailman.3860.1364451682.2939.python-list@python.org> <987c4bd9-0e5e-4387-9c78-1075a77d3c47@c6g2000yqh.googlegroups.com> <mailman.3863.1364463394.2939.python-list@python.org> <rOednY4OeOjbqcnMnZ2dnUVZ_oWdnZ2d@westnet.com.au> <5154590C.9030902@mrabarnett.plus.com>
Date: Fri, 29 Mar 2013 02:07:45 +1100
Subject: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3883.1364483268.2939.python-list@python.org>
Lines: 47
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:42142

On Fri, Mar 29, 2013 at 1:51 AM, MRAB <python@mrabarnett.plus.com> wrote:
> On 28/03/2013 12:11, Neil Hodgson wrote:
>>
>> Ian Foote:
>>
>>> Specifically, indexing a variable-length encoding like utf-8 is not
>>> as efficient as indexing a fixed-length encoding.
>>
>>
>> Many common string operations do not require indexing by character
>> which reduces the impact of this inefficiency. UTF-8 seems like a
>> reasonable choice for an internal representation to me. One benefit
>> of UTF-8 over Python's flexible representation is that it is, on
>> average, more compact over a wide set of samples.
>>
> Implementing the regex module (http://pypi.python.org/pypi/regex) would
> have been more difficult if the internal representation had been UTF-8,
> because of the need to decode, and the implementation would also have
> been slower for that reason.

In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a "for character in string" loop.

text = []

def initial(c):
    global state
    if c=='<': state=tag
    else: text.append(c)

def tag(c):
    global state
    if c=='>': state=initial

state = initial
for character in string:
    state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA