Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <l4olqp$jdn$1@ger.gmane.org>
References: <mailman.1604.1382818293.18130.python-list@python.org> <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> <mailman.1628.1382838024.18130.python-list@python.org> <pan.2013.10.27.03.21.57.202000@nowhere.com> <d205042e-29cd-49df-9f6e-600e123f8483@googlegroups.com> <526f4612$0$6512$c3e8da3$5496439d@news.astraweb.com> <63fa9fcd-6445-41ee-8873-e1ee046e2031@googlegroups.com> <l4olqp$jdn$1@ger.gmane.org>
Date: Wed, 30 Oct 2013 13:17:21 +1100
Subject: Re: trying to strip out non ascii.. or rather convert non ascii
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1787.1383099445.18130.python-list@python.org>
Lines: 40
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:57995

On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wr=
ote:
> You've stated above that logically unicode is badly handled by the fsr.  =
You
> then provide a trivial timing example.  WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:

>>> import timeit
>>> timeit.timeit("a =3D 'hundred'; 'x' in a")
0.3625614428649451
>>> timeit.timeit("a =3D 'hundre=C4=B3'; 'x' in a")
0.6753936603674484
>>> timeit.timeit("a =3D 'hundred'; '=C4=A3' in a")
0.25663261671525106
>>> timeit.timeit("a =3D 'hundre=C4=B3'; '=C4=A3' in a")
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:

>>> timeit.timeit("a =3D 'hundred'*1000; 'x' in a")
10.083378194714726
>>> timeit.timeit("a =3D 'hundre=C4=B3'*1000; 'x' in a")
18.656413035735
>>> timeit.timeit("a =3D 'hundre=C4=B3'*1000; '=C4=A3' in a")
18.436268855399135
>>> timeit.timeit("a =3D 'hundred'*1000; '=C4=A3' in a")
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA