Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'binary': 0.05; 'data:': 0.07; 'escape': 0.07; 'performs': 0.07; 'processing.': 0.07; 'utf-8': 0.07; 'python': 0.09; 'command-line': 0.09; 'fails.': 0.09; 'ignoring': 0.09; 'processing,': 0.09; 'regression': 0.09; 'sake': 0.09; 'seen,': 0.09; 'stringio': 0.09; 'subject:string': 0.09; 'to:addr:comp.lang.python': 0.09; 'worse': 0.09; 'cc:addr :python-list': 0.10; 'programmer': 0.11; 'aug': 0.13; 'encoding': 0.15; '"import': 0.16; "'r',": 0.16; '(string': 0.16; '3.2,': 0.16; '3.2.': 0.16; '3.3,': 0.16; 'already,': 0.16; 'binaries': 0.16; 'ebcdic,': 0.16; 'nonetheless': 0.16; 'operation,': 0.16; 'rule.': 0.16; 'splits': 0.16; 'subject:unicode': 0.16; 'worst': 0.16; 'string': 0.17; 'wrote:': 0.17; 'pointed': 0.17; 'typical': 0.17; 'unicode': 0.17; 'creates': 0.18; 'code,': 0.18; 'issue.': 0.20; 'mostly': 0.20; 'skip:= 10': 0.20; '3.2': 0.22; 'focusing': 0.22; 'option.': 0.22; 'stick': 0.22; 'cc:2**0': 0.23; 'this:': 0.23; "i've": 0.23; 'least': 0.25; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'skip:" 20': 0.26; 'coding': 0.27; 'separate': 0.27; 'question': 0.27; 'set.': 0.27; 'actual': 0.28; 'run': 0.28; 'equivalent.': 0.29; 'i/o': 0.29; 'strings,': 0.29; 'case,': 0.29; 'probably': 0.29; 'related': 0.30; 'usually': 0.30; 'web.': 0.30; 'expect': 0.31; 'up.': 0.31; 'code': 0.31; 'file': 0.32; 'handle': 0.33; 'problem': 0.33; 'likely': 0.33; 'another': 0.33; 'received:google.com': 0.34; 'clear': 0.35; 'faster': 0.35; 'doing': 0.35; 'pm,': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'really': 0.36; 'characters': 0.36; 'depends': 0.36; 'level.': 0.36; "i'll": 0.36; 'test': 0.36; 'unable': 0.36; 'possible': 0.37; 'does': 0.37; 'two': 0.37; 'detail': 0.37; 'quite': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'fact': 0.38; 'skip:o 20': 0.38; 'some': 0.38; 'things': 0.38; 'instead': 0.39; 'where': 0.40; 'skip:" 10': 0.40; 'from:no real name:2**0': 0.60; 'skip:u 10': 0.60; 'real': 0.61; 'subject:, ': 0.61; 'containing': 0.61; 'wide': 0.62; 'subject:...': 0.63; 'more': 0.63; 'dont': 0.64; 'decided': 0.65; '10000': 0.65; 'difficulty': 0.65; 'real-world': 0.65; 'benchmark': 0.84; 'complexity': 0.84; 'counting.': 0.84; 'subject:, ...': 0.84; 'times*': 0.84; 'rusi': 0.91; 'world:': 0.91; 'realistic': 0.93 Newsgroups: comp.lang.python Date: Wed, 29 Aug 2012 04:40:46 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=62.203.125.238; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_ References: <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> <503a0d51$0$6574$c3e8da3$5496439d@news.astraweb.com> <503a8361$0$6574$c3e8da3$5496439d@news.astraweb.com> <2e92da71-fbd2-467f-9088-1c79fa7bcf69@googlegroups.com> User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 62.203.125.238 MIME-Version: 1.0 Subject: Re: Flexible string representation, unicode, typography, ... From: wxjmfauth@gmail.com To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 217 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1346240458 news.xs4all.nl 6898 [2001:888:2000:d::a6]:33429 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:28056 Le mercredi 29 ao=FBt 2012 06:16:05 UTC+2, Ian a =E9crit=A0: > On Tue, Aug 28, 2012 at 8:42 PM, rusi wrote: >=20 > > In summary: >=20 > > 1. The problem is not on jmf's computer >=20 > > 2. It is not windows-only >=20 > > 3. It is not directly related to latin-1 encodable or not >=20 > > >=20 > > The only question which is not yet clear is this: >=20 > > Given a typical string operation that is complexity O(n), in more >=20 > > detail it is going to be O(a + bn) >=20 > > If only a is worse going 3.2 to 3.3, it may be a small issue. >=20 > > If b is worse by even a tiny amount, it is likely to be a significant >=20 > > regression for some use-cases. >=20 >=20 >=20 > As has been pointed out repeatedly already, this is a microbenchmark. >=20 > jmf is focusing in one one particular area (string construction) where >=20 > Python 3.3 happens to be slower than Python 3.2, ignoring the fact >=20 > that real code usually does lots of things other than building >=20 > strings, many of which are slower to begin with. In the real-world >=20 > benchmarks that I've seen, 3.3 is as fast as or faster than 3.2. >=20 > Here's a much more realistic benchmark that nonetheless still focuses >=20 > on strings: word counting. >=20 >=20 >=20 > Source: http://pastebin.com/RDeDsgPd >=20 >=20 >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc" >=20 > "wc.wc('unilang8.htm')" >=20 > 1000 loops, best of 3: 310 usec per loop >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc" >=20 > "wc.wc('unilang8.htm')" >=20 > 1000 loops, best of 3: 302 usec per loop >=20 >=20 >=20 > "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath >=20 > of Unicode characters that I pulled off the web. Even though this >=20 > program is still mostly string processing, Python 3.3 wins. Of >=20 > course, that's not really a very good test -- since it reads the file >=20 > on every pass, it probably spends more time in I/O than it does in >=20 > actual processing. Let's try it again with prepared string data: >=20 >=20 >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =3D >=20 > open('unilang8.htm', 'r', encoding >=20 > =3D'utf-8').read()" "wc.wc_str(t)" >=20 > 10000 loops, best of 3: 87.3 usec per loop >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =3D >=20 > open('unilang8.htm', 'r', encoding >=20 > =3D'utf-8').read()" "wc.wc_str(t)" >=20 > 10000 loops, best of 3: 84.6 usec per loop >=20 >=20 >=20 > Nope, 3.3 still wins. And just for the sake of my own curiosity, I >=20 > decided to try it again using str.split() instead of a StringIO. >=20 > Since str.split() creates more strings, I expect Python 3.2 might >=20 > actually win this time. >=20 >=20 >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =3D >=20 > open('unilang8.htm', 'r', encoding >=20 > =3D'utf-8').read()" "wc.wc_split(t)" >=20 > 10000 loops, best of 3: 88 usec per loop >=20 >=20 >=20 > C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =3D >=20 > open('unilang8.htm', 'r', encoding >=20 > =3D'utf-8').read()" "wc.wc_split(t)" >=20 > 10000 loops, best of 3: 76.5 usec per loop >=20 >=20 >=20 > Interestingly, although Python 3.2 performs the splits in about the >=20 > same time as the StringIO operation, Python 3.3 is significantly >=20 > *faster* using str.split(), at least on this data set. >=20 >=20 >=20 >=20 >=20 > > So doing some arm-chair thinking (I dont know the code and difficulty >=20 > > involved): >=20 > > >=20 > > Clearly there are 3 string-engines in the python 3 world: >=20 > > - 3.2 narrow >=20 > > - 3.2 wide >=20 > > - 3.3 (flexible) >=20 > > >=20 > > How difficult would it be to giving the choice of string engine as a >=20 > > command-line flag? >=20 > > This would avoid the nuisance of having two binaries -- narrow and >=20 > > wide. >=20 >=20 >=20 > Quite difficult. Even if we avoid having two or three separate >=20 > binaries, we would still have separate binary representations of the >=20 > string structs. It makes the maintainability of the software go down >=20 > instead of up. >=20 >=20 >=20 > > And it would give the python programmer a choice of efficiency >=20 > > profiles. >=20 >=20 >=20 > So instead of having just one test for my Unicode-handling code, I'll >=20 > now have to run that same test *three times* -- once for each possible >=20 > string engine option. Choice isn't always a good thing. >=20 >=20 Forget Python and all these benchmarks. The problem is on an other level. Coding schemes, typography, usage of characters, ... For a given coding scheme, all code points/characters are equivalent. Expecting to handle a sub-range in a coding scheme without shaking that coding scheme is impossible. If a coding scheme does not give satisfaction, the only valid solution is to create a new coding scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case, where the "internal" coding depends on the fonts! Unicode (utf***), as just one another coding scheme, does not escape to this rule. This "Flexible String Representation" fails. Not only it is unable to stick with a coding scheme, it is a mixing of coding schemes, the worst of all possible implementations. jmf