Groups > comp.lang.python > #27843 > unrolled thread

Re: Flexible string representation, unicode, typography, ...

Started by	Antoine Pitrou <solipsis@pitrou.net>
First post	2012-08-25 00:24 +0000
Last post	2012-08-25 07:23 -0400
Articles	20 on this page of 83 — 18 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
      Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                    Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                        Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                              Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                    Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                    Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                            Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                              Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                        Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                          Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                        Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
      Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →

#28251

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-02 03:06 -0600
Message-ID	<mailman.68.1346576808.27098.python-list@python.org>
In reply to	#28245

On Sun, Sep 2, 2012 at 1:36 AM,  <wxjmfauth@gmail.com> wrote:
> I still remember my thoughts when I read the PEP 393
> discussion: "this is not logical", "they do no understand
> typography", "atomic character ???", ...

That would indicate one of two possibilities.  Either:

1) Everybody in the PEP 393 discussion except for you is clueless
about how to implement a Unicode type; or

2) You are clueless about how to implement a Unicode type.

Taking into account Occam's razor, and also that you seem to be unable
or unwilling to offer a solid rationale for those thoughts, I have to
say that I'm currently leaning toward the second possibility.

> Real world exemples.
>
>>>> import libfrancais
>>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
> ...     'noétique', 'noèse', 'noirâtre']
>>>> r = libfrancais.sortfr(li)
>>>> r
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 'noirâtre']

libfrancais does not appear to be publicly available.  It's not listed
in PyPI, and googling for "python libfrancais" turns up nothing
relevant.

Rewriting the example to use locale.strcoll instead:

>>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'French_France')
'French_France.1252'
>>> import functools
>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']

# Python 3.2
>>> import timeit
>>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
[0.5544277025009592, 0.5370117249557325, 0.5551836677925053]

# Python 3.3
>>> import timeit
>>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
[0.1421166788364303, 0.12389078130001963, 0.13184190553613462]

As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
example.  If this was intended to show that the Python 3.3 Unicode
representation is a regression over the Python 3.2 implementation,
then it's a complete failure as an example.

[toc] | [prev] | [next] | [standalone]

#28292

From	wxjmfauth@gmail.com
Date	2012-09-02 11:58 -0700
Message-ID	<f8dfb1ca-e48d-4a2f-baed-3c28a2f89777@googlegroups.com>
In reply to	#28251

Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :
> On Sun, Sep 2, 2012 at 1:36 AM,  <wxjmfauth@gmail.com> wrote:
> 
> > I still remember my thoughts when I read the PEP 393
> 
> > discussion: "this is not logical", "they do no understand
> 
> > typography", "atomic character ???", ...
> 
> 
> 
> That would indicate one of two possibilities.  Either:
> 
> 
> 
> 1) Everybody in the PEP 393 discussion except for you is clueless
> 
> about how to implement a Unicode type; or
> 
> 
> 
> 2) You are clueless about how to implement a Unicode type.
> 
> 
> 
> Taking into account Occam's razor, and also that you seem to be unable
> 
> or unwilling to offer a solid rationale for those thoughts, I have to
> 
> say that I'm currently leaning toward the second possibility.
> 
> 
> 
> 
> 
> > Real world exemples.
> 
> >
> 
> >>>> import libfrancais
> 
> >>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
> 
> > ...     'noétique', 'noèse', 'noirâtre']
> 
> >>>> r = libfrancais.sortfr(li)
> 
> >>>> r
> 
> > ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 
> > 'noirâtre']
> 
> 
> 
> libfrancais does not appear to be publicly available.  It's not listed
> 
> in PyPI, and googling for "python libfrancais" turns up nothing
> 
> relevant.
> 
> 
> 
> Rewriting the example to use locale.strcoll instead:
> 
> 
> 
> >>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
> 
> >>> import locale
> 
> >>> locale.setlocale(locale.LC_ALL, 'French_France')
> 
> 'French_France.1252'
> 
> >>> import functools
> 
> >>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']
> 
> 
> 
> # Python 3.2
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.5544277025009592, 0.5370117249557325, 0.5551836677925053]
> 
> 
> 
> # Python 3.3
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.1421166788364303, 0.12389078130001963, 0.13184190553613462]
> 

> 
> As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
> 
> example.  If this was intended to show that the Python 3.3 Unicode
> 
> representation is a regression over the Python 3.2 implementation,
> 
> then it's a complete failure as an example.


- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools. 

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

jmf

[toc] | [prev] | [next] | [standalone]

#28303

From	Michael Torrie <torriem@gmail.com>
Date	2012-09-02 13:45 -0600
Message-ID	<mailman.106.1346615114.27098.python-list@python.org>
In reply to	#28292

On 09/02/2012 12:58 PM, wxjmfauth@gmail.com wrote:
> My rationale: very simple.
> 
> 1) I never heard about something better than sticking with one
> of the Unicode coding scheme. (genreral theory)
> 2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
> only one guy, who noticed problems. Arguing, "it is fast enough", is not
> a correct answer.

If this is true, why were you holding ho Google Go as an example of
doing it right?  Certainly Google Go doesn't line up with your rational.
 Go has both Strings and Runes.  But strings are UTF-8-encoded bytes
strings and Runes are 32-bit integers.  They are not interchangeable
without a costly encoding and decoding process.  Even worse, indexing a
Go string to get a "Rune" involves some very costly decoding that has to
be done starting at the beginning of the string each time.

In the worst case, Python's strings are as slow as Go because Python
does the exact same thing as Go, but chooses between three encodings
instead of just one.  Best case scenario, Python's strings could be much
faster than Go's because indexing through 2 of the 3 encodings is O(1)
because they are constant-width encodings.  If as you say, the latin-1
subset of UTF-8 is used, then UTF-8 indexing is O(1) too, otherwise it's
probably O(n).

[toc] | [prev] | [next] | [standalone]

#28310

From	Dave Angel <d@davea.name>
Date	2012-09-02 16:07 -0400
Message-ID	<mailman.108.1346616485.27098.python-list@python.org>
In reply to	#28292

On 09/02/2012 03:45 PM, Michael Torrie wrote:
> <jmfauth snipped>:
> In the worst case, Python's strings are as slow as Go because Python
> does the exact same thing as Go, but chooses between three encodings
> instead of just one. Best case scenario, Python's strings could be
> much faster than Go's because indexing through 2 of the 3 encodings is
> O(1) because they are constant-width encodings. If as you say, the
> latin-1 subset of UTF-8 is used, then UTF-8 indexing is O(1) too,
> otherwise it's probably O(n). 

I'm afraid you have it backwards.  the Utf-8 version of the
latin-1-compatible characters would be variable length.  But my
understanding of the pep is that the internal one-byte format is simply
the lowest order byte of each code point, after assuring that all code
points in the particular string are less than 256.  That's going to
coincidentally resemble latin-1's encoding, but since it's an internal
form, the resemblance is irrelevant.  Anyway, those one-byte values are
going to be O(1), naturally.

No encoding involved, and no searching nor expanding.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28316

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-02 16:38 -0400
Message-ID	<mailman.114.1346618335.27098.python-list@python.org>
In reply to	#28292

On 9/2/2012 3:45 PM, Michael Torrie wrote:

> In the worst case, Python's strings are as slow as Go because Python
> does the exact same thing as Go, but chooses between three encodings
> instead of just one.  Best case scenario, Python's strings could be much
> faster than Go's because indexing through 2 of the 3 encodings is O(1)

In CPython 3.3, indexing of str text string objects is always O(1) and 
it is always indexes and counts code points rather than code units. It 
was the latter for narrow builds in 3.2 and before. As a result, single 
character (code point) strings had a length of 2 rather than 1 for 
extended plane characters. 3.3 corrects this.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28332

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-03 01:42 +0000
Message-ID	<50440af0$0$29967$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28292

On Sun, 02 Sep 2012 11:58:08 -0700, wxjmfauth wrote:

> - Unfortunately, I got opposite and even much worst results on my win
> box, considering
> - libfrancais is one of my module and it does a little bit more than the
> std sorting tools.

How do we know that the problem isn't in your module?

> My rationale: very simple.
> 
> 1) I never heard about something better than sticking with one of the
> Unicode coding scheme. (genreral theory) 

Your ignorance is not a good reason for abandoning a powerful software 
technique.

2) I am not at all convinced by
> the "new" Py 3.3 algorithm. I'm not the only one guy, who noticed
> problems. 

That's nice.

Nobody has yet displayed genuine performance problems, only artificial 
and platform-dependent slowdowns that are insignificant in practice. If 
you can demonstrate genuine problems, people will be interested in fixing 
them.

Let me be frank: nobody gives a damn if, for some rare circumstances, 
some_string.replace(another_string) takes 0.3μs instead of 0.1μs. 
Overall, considering multiple platforms and dozens of different string 
operations, PEP 393 is a big win:

- many operations are faster
- a few operations are a LOT faster
- but a very few operations are sometimes slower
- many strings will use less memory
- sometimes a LOT less memory
- no more distinction between wide and narrow builds
- characters in the supplementary planes are now, for the first 
  time in Python, treated correctly by default

That's six wins versus one loss.

> Arguing, "it is fast enough", is not a correct answer.

It is *exactly* the correct answer.

Nobody is going to revert this just because your script now runs in 5.7ms 
instead of 5.2ms. Who cares?

If you are *seriously* interested in debugging why string code is slower 
for you, you can start by running the full suite of Python string 
benchmarks: see the stringbench benchmark in the Tools directory of 
source installations, or see here:

http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28359

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 18:26 +0300
Message-ID	<mailman.147.1346686000.27098.python-list@python.org>
In reply to	#28332

On 03.09.12 04:42, Steven D'Aprano wrote:
> If you are *seriously* interested in debugging why string code is slower
> for you, you can start by running the full suite of Python string
> benchmarks: see the stringbench benchmark in the Tools directory of
> source installations, or see here:
>
> http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench

http://hg.python.org/cpython/file/default/Tools/stringbench

However, stringbench is not good tool to measure the effectiveness of 
new string representation, because it focuses mainly on ASCII strings 
and comparing strings with bytes.

[toc] | [prev] | [next] | [standalone]

#28377

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-04 00:53 +0000
Message-ID	<504550ff$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28359

On Mon, 03 Sep 2012 18:26:02 +0300, Serhiy Storchaka wrote:

> On 03.09.12 04:42, Steven D'Aprano wrote:
>> If you are *seriously* interested in debugging why string code is
>> slower for you, you can start by running the full suite of Python
>> string benchmarks: see the stringbench benchmark in the Tools directory
>> of source installations, or see here:
>>
>> http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench
> 
> http://hg.python.org/cpython/file/default/Tools/stringbench
> 
> However, stringbench is not good tool to measure the effectiveness of
> new string representation, because it focuses mainly on ASCII strings
> and comparing strings with bytes.

But it is a good place to start, so you can develop unicode benchmarks.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28293

From	wxjmfauth@gmail.com
Date	2012-09-02 11:58 -0700
Message-ID	<mailman.102.1346612296.27098.python-list@python.org>
In reply to	#28251

Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :
> On Sun, Sep 2, 2012 at 1:36 AM,  <wxjmfauth@gmail.com> wrote:
> 
> > I still remember my thoughts when I read the PEP 393
> 
> > discussion: "this is not logical", "they do no understand
> 
> > typography", "atomic character ???", ...
> 
> 
> 
> That would indicate one of two possibilities.  Either:
> 
> 
> 
> 1) Everybody in the PEP 393 discussion except for you is clueless
> 
> about how to implement a Unicode type; or
> 
> 
> 
> 2) You are clueless about how to implement a Unicode type.
> 
> 
> 
> Taking into account Occam's razor, and also that you seem to be unable
> 
> or unwilling to offer a solid rationale for those thoughts, I have to
> 
> say that I'm currently leaning toward the second possibility.
> 
> 
> 
> 
> 
> > Real world exemples.
> 
> >
> 
> >>>> import libfrancais
> 
> >>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
> 
> > ...     'noétique', 'noèse', 'noirâtre']
> 
> >>>> r = libfrancais.sortfr(li)
> 
> >>>> r
> 
> > ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 
> > 'noirâtre']
> 
> 
> 
> libfrancais does not appear to be publicly available.  It's not listed
> 
> in PyPI, and googling for "python libfrancais" turns up nothing
> 
> relevant.
> 
> 
> 
> Rewriting the example to use locale.strcoll instead:
> 
> 
> 
> >>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
> 
> >>> import locale
> 
> >>> locale.setlocale(locale.LC_ALL, 'French_France')
> 
> 'French_France.1252'
> 
> >>> import functools
> 
> >>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']
> 
> 
> 
> # Python 3.2
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.5544277025009592, 0.5370117249557325, 0.5551836677925053]
> 
> 
> 
> # Python 3.3
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.1421166788364303, 0.12389078130001963, 0.13184190553613462]
> 

> 
> As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
> 
> example.  If this was intended to show that the Python 3.3 Unicode
> 
> representation is a regression over the Python 3.2 implementation,
> 
> then it's a complete failure as an example.


- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools. 

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

jmf

[toc] | [prev] | [next] | [standalone]

#28257

From	Peter Otten <__peter__@web.de>
Date	2012-09-02 11:52 +0200
Message-ID	<mailman.74.1346579541.27098.python-list@python.org>
In reply to	#28245

Ian Kelly wrote:

> Rewriting the example to use locale.strcoll instead:
 
>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))

There is also locale.strxfrm() which you can use directly:

sorted(li, key=locale.strxfrm)

[toc] | [prev] | [next] | [standalone]

#28260

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-02 11:36 +0100
Message-ID	<mailman.76.1346582082.27098.python-list@python.org>
In reply to	#28245

I've found the white paper which gives the technical basis for the 
claims made by jmf so thought I'd better share in order to explain his 
rationale.

http://www.montypython.net/scripts/right-think.php

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28267

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-02 15:00 +0300
Message-ID	<mailman.83.1346587277.27098.python-list@python.org>
In reply to	#28245

On 02.09.12 12:52, Peter Otten wrote:
> Ian Kelly wrote:
>
>> Rewriting the example to use locale.strcoll instead:
>
>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
>
> There is also locale.strxfrm() which you can use directly:
>
> sorted(li, key=locale.strxfrm)

Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

[toc] | [prev] | [next] | [standalone]

#28337

From	wxjmfauth@gmail.com
Date	2012-09-02 22:39 -0700
Message-ID	<b7514131-3162-4c6f-909c-52df5d666992@googlegroups.com>
In reply to	#28267

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
> On 02.09.12 12:52, Peter Otten wrote:
> 
> > Ian Kelly wrote:
> 
> >
> 
> >> Rewriting the example to use locale.strcoll instead:
> 
> >
> 
> >>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> >
> 
> > There is also locale.strxfrm() which you can use directly:
> 
> >
> 
> > sorted(li, key=locale.strxfrm)
> 
> 
> 
> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

[toc] | [prev] | [next] | [standalone]

#28339

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-03 07:11 +0100
Message-ID	<mailman.127.1346652593.27098.python-list@python.org>
In reply to	#28337

On 03/09/2012 06:39, wxjmfauth@gmail.com wrote:
> Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
>> On 02.09.12 12:52, Peter Otten wrote:
>>
>>> Ian Kelly wrote:
>>
>>>
>>
>>>> Rewriting the example to use locale.strcoll instead:
>>
>>>
>>
>>>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
>>
>>>
>>
>>> There is also locale.strxfrm() which you can use directly:
>>
>>>
>>
>>> sorted(li, key=locale.strxfrm)
>>
>>
>>
>> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
>
> With a memory gain = 0 since my text contains non-latin-1 characters!
>
> jmf
>

This is getting really funny.  Do you make a living writing comedy for 
big film or TV studios?  Your response to Steven D'Aprano's "That's six 
wins versus one loss." should be hilarious.  Or do you not respond to 
fact based posts?

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28340

From	Peter Otten <__peter__@web.de>
Date	2012-09-03 08:15 +0200
Message-ID	<mailman.128.1346652940.27098.python-list@python.org>
In reply to	#28337

wxjmfauth@gmail.com wrote:

> Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :

>> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
> 
> With a memory gain = 0 since my text contains non-latin-1 characters!

I can't confirm this. At least users of wide builds will see a decrease in 
memory use:

$ cat strxfrm_getsize.py 
import locale
import sys

print("maxunicode:", sys.maxunicode)
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
words = [
    'noël', 'noir', 'nœud', 'noduleux',
    'noétique', 'noèse', 'noirâtre']
print("total size of original strings:",
      sum(sys.getsizeof(s) for s in words))
print(
    "total size of transformed strings:",
    sum(sys.getsizeof(locale.strxfrm(s)) for s in words))

$ python3.2 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 584
total size of transformed strings: 980

$ python3.3 strxfrm_getsize.py
maxunicode: 1114111
total size of original strings: 509
total size of transformed strings: 483

The situation is more complex than you suppose -- you need less dogma and 
more experiments ;)

[toc] | [prev] | [next] | [standalone]

#28344

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-03 04:38 -0400
Message-ID	<mailman.134.1346661523.27098.python-list@python.org>
In reply to	#28337

On 9/3/2012 2:15 AM, Peter Otten wrote:
> At least users of wide builds will see a decrease in memory use:

Everyone saves because everyone uses large parts of the stdlib. When 3.3 
start up in a Windows console, there are 56 modules in sys.modules. With 
Idle, there are over 130. All the identifiers, all the global, local, 
and attribute names are present as ascii-only strings. Now multiply that 
by some reasonable average, keeping in mind that __builtins__ alone has 
148 names.

Former narrow build users gain less space but also gain the elimination 
of buggy behavior.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28361

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 18:56 +0300
Message-ID	<mailman.150.1346687830.27098.python-list@python.org>
In reply to	#28337

On 03.09.12 09:15, Peter Otten wrote:
> wxjmfauth@gmail.com wrote:
>> Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
>
>>> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
>>
>> With a memory gain = 0 since my text contains non-latin-1 characters!
>
> I can't confirm this. At least users of wide builds will see a decrease in
> memory use:

And only users of wide builds will see a 20% decrease in speed for this 
data (with longer strings Python 3.3 will outstrip Python 3.2). This 
happens because of the inevitable transformation UCS2 -> wchar_t and 
wchar_t -> UCS2 on platform with 4-bytes wchar_t. On Windows there 
should be no slowing down.

[toc] | [prev] | [next] | [standalone]

#28338

From	wxjmfauth@gmail.com
Date	2012-09-02 22:39 -0700
Message-ID	<mailman.126.1346650787.27098.python-list@python.org>
In reply to	#28267

Le dimanche 2 septembre 2012 14:01:18 UTC+2, Serhiy Storchaka a écrit :
> On 02.09.12 12:52, Peter Otten wrote:
> 
> > Ian Kelly wrote:
> 
> >
> 
> >> Rewriting the example to use locale.strcoll instead:
> 
> >
> 
> >>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> >
> 
> > There is also locale.strxfrm() which you can use directly:
> 
> >
> 
> > sorted(li, key=locale.strxfrm)
> 
> 
> 
> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

With a memory gain = 0 since my text contains non-latin-1 characters!

jmf

[toc] | [prev] | [next] | [standalone]

#28268

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-02 13:23 +0100
Message-ID	<mailman.84.1346588596.27098.python-list@python.org>
In reply to	#28245

On 02/09/2012 13:00, Serhiy Storchaka wrote:
> On 02.09.12 12:52, Peter Otten wrote:
>> Ian Kelly wrote:
>>
>>> Rewriting the example to use locale.strcoll instead:
>>
>>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
>>
>> There is also locale.strxfrm() which you can use directly:
>>
>> sorted(li, key=locale.strxfrm)
>
> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
>
>

That's it then I'm giving up with Python.  In future I'll be writing 
everything in machine code to ensure that I get the fastest possible run 
times.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28269

From	Roy Smith <roy@panix.com>
Date	2012-09-02 08:35 -0400
Message-ID	<roy-FC61B4.08351302092012@news.panix.com>
In reply to	#28268

In article <mailman.84.1346588596.27098.python-list@python.org>,
 Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:

> On 02/09/2012 13:00, Serhiy Storchaka wrote:
> > On 02.09.12 12:52, Peter Otten wrote:
> >> Ian Kelly wrote:
> >>
> >>> Rewriting the example to use locale.strcoll instead:
> >>
> >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> >>
> >> There is also locale.strxfrm() which you can use directly:
> >>
> >> sorted(li, key=locale.strxfrm)
> >
> > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
> >
> >
> 
> That's it then I'm giving up with Python.  In future I'll be writing 
> everything in machine code to ensure that I get the fastest possible run 
> times.

Feh.  You software guys are always too willing to sacrifice performance 
for convenience.  If you really want speed, grab yourself a handful of 
chips and a soldering iron.

[toc] | [prev] | [next] | [standalone]

Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →

csiph-web

Re: Flexible string representation, unicode, typography, ...

Contents

#28251

#28292

#28303

#28310

#28316

#28332

#28359

#28377

#28293

#28257

#28260

#28267

#28337

#28339

#28340

#28344

#28361

#28338

#28268

#28269