Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #57656 > unrolled thread

trying to strip out non ascii.. or rather convert non ascii

Started bybruce <badouglas@gmail.com>
First post2013-10-26 16:11 -0400
Last post2013-10-30 15:25 +0000
Articles 2 on this page of 42 — 14 participants

Back to article view | Back to comp.lang.python


Contents

  trying to strip out non ascii.. or rather convert non ascii bruce <badouglas@gmail.com> - 2013-10-26 16:11 -0400
    Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-26 22:24 +0000
      Re: trying to strip out non ascii.. or rather convert non ascii Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-26 20:51 -0400
        Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:11 -0400
          Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-27 02:05 +0000
            Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-27 13:15 +1100
          Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-27 09:21 +0000
      Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 20:41 -0500
        Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:54 -0400
          Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 21:17 -0500
        Re: trying to strip out non ascii.. or rather convert non ascii Nobody <nobody@nowhere.com> - 2013-10-27 03:21 +0000
          Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-28 07:01 -0700
            Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-28 14:13 +0000
            Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-28 09:23 -0500
              Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:24 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:49 -0700
                  Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 08:44 -0400
                    Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 09:08 -0700
                      Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 16:24 +0000
                      Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 13:10 -0400
                      Re: trying to strip out non ascii.. or rather convert non ascii Michael Torrie <torriem@gmail.com> - 2013-10-30 11:54 -0600
                        Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 11:38 -0700
                        Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-30 19:28 -0400
                          Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-31 06:46 -0500
                      Re: trying to strip out non ascii.. or rather convert non ascii Terry Reedy <tjreedy@udel.edu> - 2013-10-30 17:56 -0400
                  Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-31 07:10 +0000
                    Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-31 07:23 +0000
                    Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-31 03:33 -0700
                      Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-01 07:16 +0000
                        Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-11-01 02:00 -0700
                          Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-01 09:18 +0000
            Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:22 +0000
              Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 08:38 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-29 10:52 -0500
                  Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 12:16 -0700
                    Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 19:54 +0000
                      Re: trying to strip out non ascii.. or rather convert non ascii Piet van Oostrum <piet@vanoostrum.org> - 2013-10-29 21:33 -0400
                        Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 09:19 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 15:56 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-30 13:17 +1100
                  Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:13 -0700
                    Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 15:25 +0000

Page 3 of 3 — ← Prev page 1 2 [3]


#58011

Fromwxjmfauth@gmail.com
Date2013-10-30 01:13 -0700
Message-ID<679b3de7-1a60-41be-867a-cde162aaba47@googlegroups.com>
In reply to#57995
Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit :
> On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> 
> > You've stated above that logically unicode is badly handled by the fsr.  You
> 
> > then provide a trivial timing example.  WTF???
> 
> 
> 
> His idea of bad handling is "oh how terrible, ASCII and BMP have
> 
> optimizations". He hates the idea that it could be better in some
> 
> areas instead of even timings all along. But the FSR actually has some
> 
> distinct benefits even in the areas he's citing - watch this:
> 
> 
> 
> >>> import timeit
> 
> >>> timeit.timeit("a = 'hundred'; 'x' in a")
> 
> 0.3625614428649451
> 
> >>> timeit.timeit("a = 'hundreij'; 'x' in a")
> 
> 0.6753936603674484
> 
> >>> timeit.timeit("a = 'hundred'; 'ģ' in a")
> 
> 0.25663261671525106
> 
> >>> timeit.timeit("a = 'hundreij'; 'ģ' in a")
> 
> 0.3582399439035271
> 
> 
> 
> The first two examples are his examples done on my computer, so you
> 
> can see how all four figures compare. Note how testing for the
> 
> presence of a non-Latin1 character in an 8-bit string is very fast.
> 
> Same goes for testing for non-BMP character in a 16-bit string. The
> 
> difference gets even larger if the string is longer:
> 
> 
> 
> >>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
> 
> 10.083378194714726
> 
> >>> timeit.timeit("a = 'hundreij'*1000; 'x' in a")
> 
> 18.656413035735
> 
> >>> timeit.timeit("a = 'hundreij'*1000; 'ģ' in a")
> 
> 18.436268855399135
> 
> >>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
> 
> 2.8308718007456264
> 
> 
> 
> Wow! The FSR speeds up searches immensely! It's obviously the best
> 
> thing since sliced bread!
> 
> 
> 
> ChrisA

---------


It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).

>>> # FSR
>>> sys.getsizeof( ('a'*1000) + 'z')
1026
>>> sys.getsizeof( ('a'*1000) + '€')
2040
>>> # utf-8
>>> sys.getsizeof( (('a'*1000) + 'z').encode('utf-8'))
1018
>>> sys.getsizeof( (('a'*1000) + '€').encode('utf-8'))
1020

jmf

[toc] | [prev] | [next] | [standalone]


#58048

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2013-10-30 15:25 +0000
Message-ID<mailman.1813.1383146755.18130.python-list@python.org>
In reply to#58011
On 30/10/2013 08:13, wxjmfauth@gmail.com wrote:
> Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit :
>> On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
>>
>>> You've stated above that logically unicode is badly handled by the fsr.  You
>>
>>> then provide a trivial timing example.  WTF???
>>
>>
>>
>> His idea of bad handling is "oh how terrible, ASCII and BMP have
>>
>> optimizations". He hates the idea that it could be better in some
>>
>> areas instead of even timings all along. But the FSR actually has some
>>
>> distinct benefits even in the areas he's citing - watch this:
>>
>>
>>
>>>>> import timeit
>>
>>>>> timeit.timeit("a = 'hundred'; 'x' in a")
>>
>> 0.3625614428649451
>>
>>>>> timeit.timeit("a = 'hundreij'; 'x' in a")
>>
>> 0.6753936603674484
>>
>>>>> timeit.timeit("a = 'hundred'; 'ģ' in a")
>>
>> 0.25663261671525106
>>
>>>>> timeit.timeit("a = 'hundreij'; 'ģ' in a")
>>
>> 0.3582399439035271
>>
>>
>>
>> The first two examples are his examples done on my computer, so you
>>
>> can see how all four figures compare. Note how testing for the
>>
>> presence of a non-Latin1 character in an 8-bit string is very fast.
>>
>> Same goes for testing for non-BMP character in a 16-bit string. The
>>
>> difference gets even larger if the string is longer:
>>
>>
>>
>>>>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
>>
>> 10.083378194714726
>>
>>>>> timeit.timeit("a = 'hundreij'*1000; 'x' in a")
>>
>> 18.656413035735
>>
>>>>> timeit.timeit("a = 'hundreij'*1000; 'ģ' in a")
>>
>> 18.436268855399135
>>
>>>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
>>
>> 2.8308718007456264
>>
>>
>>
>> Wow! The FSR speeds up searches immensely! It's obviously the best
>>
>> thing since sliced bread!
>>
>>
>>
>> ChrisA
>
> ---------
>
>
> It is not obvious to make comparaisons with all these
> methods and characters (lookup depending on the position
> in the table, ...). The only think that can be done and
> observed is the tendency between the subsets the FSR
> artificially creates.
> One can use the best algotithms to adjust bytes, it is
> very hard to escape from the fact that if one manipulates
> two strings with different internal representations, it
> is necessary to find a way to have a "common internal
> coding " prior manipulations.
> It seems to me that this FSR, with its "negative logic"
> is always attempting to "optimize" with the worst
> case instead of "optimizing" with the best case.
> This kind of effect is shining on the memory side.
> Compare utf-8, which has a memory optimization on
> a per code point basis with the FSR which has an
> optimization based on subsets (One of its purpose).
>
>>>> # FSR
>>>> sys.getsizeof( ('a'*1000) + 'z')
> 1026
>>>> sys.getsizeof( ('a'*1000) + '€')
> 2040
>>>> # utf-8
>>>> sys.getsizeof( (('a'*1000) + 'z').encode('utf-8'))
> 1018
>>>> sys.getsizeof( (('a'*1000) + '€').encode('utf-8'))
> 1020
>
> jmf
>

How do theses figures compare to the ones quoted here 
https://mail.python.org/pipermail/python-dev/2011-September/113714.html ?

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [standalone]


Page 3 of 3 — ← Prev page 1 2 [3]

Back to top | Article view | comp.lang.python


csiph-web