Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'string.': 0.05; 'string': 0.09; 'ascii': 0.09; 'badly': 0.09; 'lawrence': 0.09; 'subject:trying': 0.09; '16-bit': 0.16; '8-bit': 0.16; 'compare.': 0.16; 'distinct': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'sliced': 0.16; 'subject:non': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'wed,': 0.18; 'figures': 0.19; 'examples': 0.20; '>>>': 0.22; 'import': 0.22; 'example.': 0.24; 'unicode': 0.24; 'handling': 0.26; 'this:': 0.26; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'testing': 0.29; 'am,': 0.29; 'character': 0.29; 'message-id:@mail.gmail.com': 0.30; "skip:' 10": 0.31; 'along.': 0.31; 'fast.': 0.31; 'searches': 0.31; 'trivial': 0.31; 'handled': 0.32; 'could': 0.34; 'but': 0.35; 'received:google.com': 0.35; "he's": 0.36; 'done': 0.36; 'two': 0.37; 'to:addr:python-list': 0.38; 'bad': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'even': 0.60; 'areas': 0.61; 'first': 0.61; "you've": 0.63; 'provide': 0.64; '30,': 0.65; 'stated': 0.69; 'subject:.. ': 0.84; 'timings': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=4NfmWn5G0kVE9zYO1iDlIYUh74ENXzUMboATZLx2Z6g=; b=nIAw4DAQH3KyaMbGSDxiPwLF/S5hEjQpA1Ye4BlCF6Ry0PqxmxomSxrnNxtHysXSTE H/uF0evjTQZBiVI25B03lC3MwGf4woygNepeQzfmMTJKHBbX/Aln9YCSxiBoLf9ZQCPp fBbe0kfjXyH7/nfnyXgXnl3IwmvRWlGDvGX8GUwhPcCCrlsGtPDM3Njw7wphBUYY9DS5 1DKocI2BEzJwyVvqe6r714nNBkOlz6m0tKbunPzIiiWibMAsA/avOLkN9ZBZjcdur877 nVALu6OhaxsuAYZ2KiqtoUTTKvhev+bynOSbIXcsxNGWtAZIa/PPIO0as5A3fxkCCvBO lwCA== MIME-Version: 1.0 X-Received: by 10.66.66.161 with SMTP id g1mr248236pat.175.1383099441563; Tue, 29 Oct 2013 19:17:21 -0700 (PDT) In-Reply-To: References: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> <526f4612$0$6512$c3e8da3$5496439d@news.astraweb.com> <63fa9fcd-6445-41ee-8873-e1ee046e2031@googlegroups.com> Date: Wed, 30 Oct 2013 13:17:21 +1100 Subject: Re: trying to strip out non ascii.. or rather convert non ascii From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 40 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1383099445 news.xs4all.nl 15986 [2001:888:2000:d::a6]:43002 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57995 On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence wr= ote: > You've stated above that logically unicode is badly handled by the fsr. = You > then provide a trivial timing example. WTF??? His idea of bad handling is "oh how terrible, ASCII and BMP have optimizations". He hates the idea that it could be better in some areas instead of even timings all along. But the FSR actually has some distinct benefits even in the areas he's citing - watch this: >>> import timeit >>> timeit.timeit("a =3D 'hundred'; 'x' in a") 0.3625614428649451 >>> timeit.timeit("a =3D 'hundre=C4=B3'; 'x' in a") 0.6753936603674484 >>> timeit.timeit("a =3D 'hundred'; '=C4=A3' in a") 0.25663261671525106 >>> timeit.timeit("a =3D 'hundre=C4=B3'; '=C4=A3' in a") 0.3582399439035271 The first two examples are his examples done on my computer, so you can see how all four figures compare. Note how testing for the presence of a non-Latin1 character in an 8-bit string is very fast. Same goes for testing for non-BMP character in a 16-bit string. The difference gets even larger if the string is longer: >>> timeit.timeit("a =3D 'hundred'*1000; 'x' in a") 10.083378194714726 >>> timeit.timeit("a =3D 'hundre=C4=B3'*1000; 'x' in a") 18.656413035735 >>> timeit.timeit("a =3D 'hundre=C4=B3'*1000; '=C4=A3' in a") 18.436268855399135 >>> timeit.timeit("a =3D 'hundred'*1000; '=C4=A3' in a") 2.8308718007456264 Wow! The FSR speeds up searches immensely! It's obviously the best thing since sliced bread! ChrisA