Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Wed, 29 Aug 2012 04:40:46 -0700 (PDT)
In-Reply-To: <mailman.3920.1346213765.4697.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=62.203.125.238; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
References: <mailman.3784.1345854291.4697.python-list@python.org> <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> <k1a40u$r47$2@ger.gmane.org> <mailman.3793.1345888006.4697.python-list@python.org> <f6266544-d67c-4589-a3ed-c14428ead237@googlegroups.com> <mailman.3816.1345933655.4697.python-list@python.org> <mailman.3831.1345964382.4697.python-list@python.org> <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> <mailman.3841.1345995646.4697.python-list@python.org> <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> <mailman.3853.1346014938.4697.python-list@python.org> <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> <UIOdnTQtcNTRlKHNnZ2dnUVZ_vednZ2d@westnet.com.au> <a15ab72d-996e-4aff-a70b-440b7baa6d68@j9g2000pbg.googlegroups.com> <mailman.3920.1346213765.4697.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Flexible string representation, unicode, typography, ...
From: wxjmfauth@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Python <python-list@python.org>
Precedence: list
Message-ID: <mailman.3927.1346240457.4697.python-list@python.org>
Lines: 217
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:28056

Le mercredi 29 ao=FBt 2012 06:16:05 UTC+2, Ian a =E9crit=A0:
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
>=20
> > In summary:
>=20
> > 1. The problem is not on jmf's computer
>=20
> > 2. It is not windows-only
>=20
> > 3. It is not directly related to latin-1 encodable or not
>=20
> >
>=20
> > The only question which is not yet clear is this:
>=20
> > Given a typical string operation that is complexity O(n), in more
>=20
> > detail it is going to be O(a + bn)
>=20
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
>=20
> > If b is worse by even a tiny amount, it is likely to be a significant
>=20
> > regression for some use-cases.
>=20
>=20
>=20
> As has been pointed out repeatedly already, this is a microbenchmark.
>=20
> jmf is focusing in one one particular area (string construction) where
>=20
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
>=20
> that real code usually does lots of things other than building
>=20
> strings, many of which are slower to begin with.  In the real-world
>=20
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
>=20
> Here's a much more realistic benchmark that nonetheless still focuses
>=20
> on strings: word counting.
>=20
>=20
>=20
> Source: http://pastebin.com/RDeDsgPd
>=20
>=20
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
>=20
> "wc.wc('unilang8.htm')"
>=20
> 1000 loops, best of 3: 310 usec per loop
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
>=20
> "wc.wc('unilang8.htm')"
>=20
> 1000 loops, best of 3: 302 usec per loop
>=20
>=20
>=20
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
>=20
> of Unicode characters that I pulled off the web.  Even though this
>=20
> program is still mostly string processing, Python 3.3 wins.  Of
>=20
> course, that's not really a very good test -- since it reads the file
>=20
> on every pass, it probably spends more time in I/O than it does in
>=20
> actual processing.  Let's try it again with prepared string data:
>=20
>=20
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =3D
>=20
> open('unilang8.htm', 'r', encoding
>=20
> =3D'utf-8').read()" "wc.wc_str(t)"
>=20
> 10000 loops, best of 3: 87.3 usec per loop
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =3D
>=20
> open('unilang8.htm', 'r', encoding
>=20
> =3D'utf-8').read()" "wc.wc_str(t)"
>=20
> 10000 loops, best of 3: 84.6 usec per loop
>=20
>=20
>=20
> Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
>=20
> decided to try it again using str.split() instead of a StringIO.
>=20
> Since str.split() creates more strings, I expect Python 3.2 might
>=20
> actually win this time.
>=20
>=20
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =3D
>=20
> open('unilang8.htm', 'r', encoding
>=20
> =3D'utf-8').read()" "wc.wc_split(t)"
>=20
> 10000 loops, best of 3: 88 usec per loop
>=20
>=20
>=20
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =3D
>=20
> open('unilang8.htm', 'r', encoding
>=20
> =3D'utf-8').read()" "wc.wc_split(t)"
>=20
> 10000 loops, best of 3: 76.5 usec per loop
>=20
>=20
>=20
> Interestingly, although Python 3.2 performs the splits in about the
>=20
> same time as the StringIO operation, Python 3.3 is significantly
>=20
> *faster* using str.split(), at least on this data set.
>=20
>=20
>=20
>=20
>=20
> > So doing some arm-chair thinking (I dont know the code and difficulty
>=20
> > involved):
>=20
> >
>=20
> > Clearly there are 3 string-engines in the python 3 world:
>=20
> > - 3.2 narrow
>=20
> > - 3.2 wide
>=20
> > - 3.3 (flexible)
>=20
> >
>=20
> > How difficult would it be to giving the choice of string engine as a
>=20
> > command-line flag?
>=20
> > This would avoid the nuisance of having two binaries -- narrow and
>=20
> > wide.
>=20
>=20
>=20
> Quite difficult.  Even if we avoid having two or three separate
>=20
> binaries, we would still have separate binary representations of the
>=20
> string structs.  It makes the maintainability of the software go down
>=20
> instead of up.
>=20
>=20
>=20
> > And it would give the python programmer a choice of efficiency
>=20
> > profiles.
>=20
>=20
>=20
> So instead of having just one test for my Unicode-handling code, I'll
>=20
> now have to run that same test *three times* -- once for each possible
>=20
> string engine option.  Choice isn't always a good thing.
>=20
>=20

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf