Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'from:addr:yahoo.co.uk': 0.04; 'string.': 0.05; 'url:pipermail': 0.05; '---------': 0.07; 'utf-8': 0.07; 'string': 0.09; 'ascii': 0.09; 'badly': 0.09; 'bytes,': 0.09; 'escape': 0.09; 'lawrence': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:trying': 0.09; 'tismer': 0.09; 'python': 0.11; '16-bit': 0.16; '8-bit': 0.16; 'compare.': 0.16; 'distinct': 0.16; 'quoted': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'side.': 0.16; 'sliced': 0.16; 'subject:non': 0.16; 'worst': 0.16; 'language': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'wed,': 0.18; 'figures': 0.19; 'examples': 0.20; 'seems': 0.21; '>>>': 0.22; 'memory': 0.22; 'programming': 0.22; 'import': 0.22; 'coding': 0.22; 'header:User-Agent:1': 0.23; 'case.': 0.24; 'example.': 0.24; 'unicode': 0.24; 'compare': 0.26; 'handling': 0.26; 'this:': 0.26; 'second': 0.26; 'gets': 0.27; 'header:X -Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'point': 0.28; 'testing': 0.29; 'chris': 0.29; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; 'code': 0.31; "skip:' 10": 0.31; '>>>>': 0.31; 'along.': 0.31; 'fast.': 0.31; 'searches': 0.31; 'table,': 0.31; 'trivial': 0.31; 'handled': 0.32; 'url:python': 0.33; 'could': 0.34; 'but': 0.35; 'adjust': 0.36; "he's": 0.36; 'done': 0.36; 'url:org': 0.36; 'two': 0.37; 'christian': 0.38; 'to:addr:python-list': 0.38; 'fact': 0.38; 'bad': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'url:mail': 0.40; 'how': 0.40; 'even': 0.60; 'areas': 0.61; 'world.': 0.61; 'first': 0.61; "you've": 0.63; 'email addr:gmail.com': 0.63; 'kind': 0.63; 'provide': 0.64; 'different': 0.65; '30,': 0.65; 'here': 0.66; 'between': 0.67; 'stated': 0.69; 'obvious': 0.74; 'observed': 0.84; 'subject:.. ': 0.84; 'timings': 0.84; 'url:python-dev': 0.84; '2013': 0.98 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Mark Lawrence Subject: Re: trying to strip out non ascii.. or rather convert non ascii Date: Wed, 30 Oct 2013 15:25:38 +0000 References: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> <526f4612$0$6512$c3e8da3$5496439d@news.astraweb.com> <63fa9fcd-6445-41ee-8873-e1ee046e2031@googlegroups.com> <679b3de7-1a60-41be-867a-cde162aaba47@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Gmane-NNTP-Posting-Host: host-78-147-188-83.as13285.net User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 In-Reply-To: <679b3de7-1a60-41be-867a-cde162aaba47@googlegroups.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 122 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1383146755 news.xs4all.nl 16003 [2001:888:2000:d::a6]:55760 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:58048 On 30/10/2013 08:13, wxjmfauth@gmail.com wrote: > Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit : >> On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence wrote: >> >>> You've stated above that logically unicode is badly handled by the fsr. You >> >>> then provide a trivial timing example. WTF??? >> >> >> >> His idea of bad handling is "oh how terrible, ASCII and BMP have >> >> optimizations". He hates the idea that it could be better in some >> >> areas instead of even timings all along. But the FSR actually has some >> >> distinct benefits even in the areas he's citing - watch this: >> >> >> >>>>> import timeit >> >>>>> timeit.timeit("a = 'hundred'; 'x' in a") >> >> 0.3625614428649451 >> >>>>> timeit.timeit("a = 'hundreij'; 'x' in a") >> >> 0.6753936603674484 >> >>>>> timeit.timeit("a = 'hundred'; 'ģ' in a") >> >> 0.25663261671525106 >> >>>>> timeit.timeit("a = 'hundreij'; 'ģ' in a") >> >> 0.3582399439035271 >> >> >> >> The first two examples are his examples done on my computer, so you >> >> can see how all four figures compare. Note how testing for the >> >> presence of a non-Latin1 character in an 8-bit string is very fast. >> >> Same goes for testing for non-BMP character in a 16-bit string. The >> >> difference gets even larger if the string is longer: >> >> >> >>>>> timeit.timeit("a = 'hundred'*1000; 'x' in a") >> >> 10.083378194714726 >> >>>>> timeit.timeit("a = 'hundreij'*1000; 'x' in a") >> >> 18.656413035735 >> >>>>> timeit.timeit("a = 'hundreij'*1000; 'ģ' in a") >> >> 18.436268855399135 >> >>>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a") >> >> 2.8308718007456264 >> >> >> >> Wow! The FSR speeds up searches immensely! It's obviously the best >> >> thing since sliced bread! >> >> >> >> ChrisA > > --------- > > > It is not obvious to make comparaisons with all these > methods and characters (lookup depending on the position > in the table, ...). The only think that can be done and > observed is the tendency between the subsets the FSR > artificially creates. > One can use the best algotithms to adjust bytes, it is > very hard to escape from the fact that if one manipulates > two strings with different internal representations, it > is necessary to find a way to have a "common internal > coding " prior manipulations. > It seems to me that this FSR, with its "negative logic" > is always attempting to "optimize" with the worst > case instead of "optimizing" with the best case. > This kind of effect is shining on the memory side. > Compare utf-8, which has a memory optimization on > a per code point basis with the FSR which has an > optimization based on subsets (One of its purpose). > >>>> # FSR >>>> sys.getsizeof( ('a'*1000) + 'z') > 1026 >>>> sys.getsizeof( ('a'*1000) + '€') > 2040 >>>> # utf-8 >>>> sys.getsizeof( (('a'*1000) + 'z').encode('utf-8')) > 1018 >>>> sys.getsizeof( (('a'*1000) + '€').encode('utf-8')) > 1020 > > jmf > How do theses figures compare to the ones quoted here https://mail.python.org/pipermail/python-dev/2011-September/113714.html ? -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence