Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Tue, 21 Aug 2012 10:16:06 -0700 (PDT)
In-Reply-To: <mailman.3594.1345535550.4697.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=81.62.84.157; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
References: <f801e06f-f7b2-4aca-b352-66856a939746@googlegroups.com> <308df2af-abe7-4043-b199-0a39f440e0ab@googlegroups.com> <502f8a2a$0$29978$c3e8da3$5496439d@news.astraweb.com> <7xehn4vyya.fsf@ruckus.brouhaha.com> <5030832d$0$29978$c3e8da3$5496439d@news.astraweb.com> <7x8vdbmho6.fsf@ruckus.brouhaha.com> <mailman.3511.1345397678.4697.python-list@python.org> <7xfw7ilqnd.fsf@ruckus.brouhaha.com> <50314968$0$29978$c3e8da3$5496439d@news.astraweb.com> <7xwr0ua1pw.fsf@ruckus.brouhaha.com> <mailman.3547.1345451405.4697.python-list@python.org> <f0878bc4-539b-4570-a138-ea413e6d9fbb@googlegroups.com> <mailman.3594.1345535550.4697.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Abuse of subject, was Re: Abuse of Big Oh notation
From: wxjmfauth@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Message-ID: <mailman.3609.1345569376.4697.python-list@python.org>
Lines: 125
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:27579

Le mardi 21 ao=FBt 2012 09:52:09 UTC+2, Peter Otten a =E9crit=A0:
> wxjmfauth@gmail.com wrote:
>=20
>=20
>=20
> > By chance and luckily, first attempt.
>=20
> =20
>=20
> > c:\python32\python -m timeit "('=80'*100+'=80'*100).replace('=80'
>=20
> > , '=9C')"
>=20
> > 1000000 loops, best of 3: 1.48 usec per loop
>=20
> > c:\python33\python -m timeit "('=80'*100+'=80'*100).replace('=80'
>=20
> > , '=9C')"
>=20
> > 100000 loops, best of 3: 7.62 usec per loop
>=20
>=20
>=20
> OK, that is roughly factor 5. Let's see what I get:
>=20
>=20
>=20
> $ python3.2 -m timeit '("=80"*100+"=80"*100).replace("=80", "=9C")'
>=20
> 100000 loops, best of 3: 1.8 usec per loop
>=20
> $ python3.3 -m timeit '("=80"*100+"=80"*100).replace("=80", "=9C")'
>=20
> 10000 loops, best of 3: 9.11 usec per loop
>=20
>=20
>=20
> That is factor 5, too. So I can replicate your measurement on an AMD64 Li=
nux=20
>=20
> system with self-built 3.3 versus system 3.2.
>=20
>=20
>=20
> > Note
>=20
> > The used characters are not members of the latin-1 coding
>=20
> > scheme (btw an *unusable* coding).
>=20
> > They are however charaters in cp1252 and mac-roman.
>=20
>=20
>=20
> You seem to imply that the slowdown is connected to the inability of lati=
n-1=20
>=20
> to encode "=9C" and "=80" (to take the examples relevant to the above=20
>=20
> microbench). So let's repeat with latin-1 characters:
>=20
>=20
>=20
> $ python3.2 -m timeit '("=E4"*100+"=E4"*100).replace("=E4", "=DF")'
>=20
> 100000 loops, best of 3: 1.76 usec per loop
>=20
> $ python3.3 -m timeit '("=E4"*100+"=E4"*100).replace("=E4", "=DF")'
>=20
> 10000 loops, best of 3: 10.3 usec per loop
>=20
>=20
>=20
> Hm, the slowdown is even a tad bigger. So we can safely dismiss your theo=
ry=20
>=20
> that an unfortunate choice of the 8 bit encoding is causing it. Do you=20
>=20
> agree?

- I do not care too much about the numbers. It's
an attempt to show the principles.

- The fact, considering latin-1 as a bad coding,
lies on the point that is is simply unsuable
for some scripts / languages. It has mainly to do
with source/text files coding. This is not really
the point here.

- Now, the technical aspect. This "coding" (latin-1)
may be considered somehow as the pseudo-coding covering
the unicode code points range 128..255. Unfortunatelly,
this "coding" is not very optimal (or can be see as) when
you work with a full range of Unicode, but is is fine
when one works only in pure latin-1, with only 256
characters.
This range 128..255 is always the critical part
(all codings considered). And probably represents
the most used characters.

I hope, it was not too confused.

I have no proof for my theory. With my experience on that
field, I highly suspect this as the bottleneck.

Some os as before.

Py 3.2.3
>>> timeit.repeat("('=80'*100+'=80'*100).replace('=80', '=9C')")
[1.5384088242603358, 1.532421642233382, 1.5327445924545433]
>>> timeit.repeat("('=E4'*100+'=E4'*100).replace('=E4', '=DF')")
[1.561762063667686, 1.5443503206462594, 1.5458670051605168]


3.3.0b2
>>> timeit.repeat("('=80'*100+'=80'*100).replace('=80', '=9C')")
[7.701523104134512, 7.720358191179441, 7.614549852683501]>>>=20
>>> timeit.repeat("('=E4'*100+'=E4'*100).replace('=E4', '=DF')")
[4.887939423990709, 4.868787294350611, 4.865697999795991]

Quite mysterious!

In any way it is a regression.

jmf