Groups > comp.lang.python > #27843 > unrolled thread

Re: Flexible string representation, unicode, typography, ...

Started by	Antoine Pitrou <solipsis@pitrou.net>
First post	2012-08-25 00:24 +0000
Last post	2012-08-25 07:23 -0400
Articles	20 on this page of 83 — 18 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
      Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                    Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                        Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                              Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                    Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                    Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                            Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                              Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                        Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                          Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                        Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
      Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →

#28054

From	wxjmfauth@gmail.com
Date	2012-08-29 04:38 -0700
Message-ID	<mailman.3926.1346240303.4697.python-list@python.org>
In reply to	#27998

Le lundi 27 août 2012 22:37:03 UTC+2, (inconnu) a écrit :
> Le lundi 27 août 2012 22:14:07 UTC+2, Ian a écrit :
> 
> > On Mon, Aug 27, 2012 at 1:16 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > 
> 
> > > - Why int32 and not uint32? No idea, I tried to find an
> 
> > 
> 
> > > answer without asking.
> 
> > 
> 
> > 
> 
> > 
> 
> > UCS-4 is technically only a 31-bit encoding. The sign bit is not used,
> 
> > 
> 
> > so the choice of int32 vs. uint32 is inconsequential.
> 
> > 
> 
> > 
> 
> > 
> 
> > (In fact, since they made the decision to limit Unicode to the range 0
> 
> > 
> 
> > - 0x0010FFFF, one might even point out that the *entire high-order
> 
> > 
> 
> > byte* as well as 3 bits of the next byte are irrelevant.  Truly,
> 
> > 
> 
> > UTF-32 is not designed for memory efficiency.)
> 
> 
> 
> I know all this. The question is more, why not a uint32 knowing
> 
> there are only positive code points. It seems to me more "natural".

Answer found. In short: using negative ints
simplifies internal tasks.

[toc] | [prev] | [next] | [standalone]

#28007

From	Neil Hodgson <nhodgson@iinet.net.au>
Date	2012-08-28 09:54 +1000
Message-ID	<UIOdnTQtcNTRlKHNnZ2dnUVZ_vednZ2d@westnet.com.au>
In reply to	#27994

wxjmfauth@gmail.com:

> Go "has" the integers int32 and int64. A rune ensure
> the usage of int32. "Text libs" use runes. Go has only
> bytes and runes.

     Go's text libraries use UTF-8 encoded byte strings. Not arrays of 
runes. See, for example,
http://golang.org/pkg/regexp/

    Are you claiming that UTF-8 is the optimum string representation and 
therefore should be used by Python?

    Neil

[toc] | [prev] | [next] | [standalone]

#28042

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-29 13:59 +1000
Message-ID	<mailman.3919.1346212776.4697.python-list@python.org>
In reply to	#28007

On Wed, Aug 29, 2012 at 12:42 PM, rusi <rustompmody@gmail.com> wrote:
> Clearly there are 3 string-engines in the python 3 world:
> - 3.2 narrow
> - 3.2 wide
> - 3.3 (flexible)
>
> How difficult would it be to giving the choice of string engine as a
> command-line flag?
> This would avoid the nuisance of having two binaries -- narrow and
> wide.
> And it would give the python programmer a choice of efficiency
> profiles.

To what benefit?

3.2 narrow is, I would have to say, buggy. It handles everything up to
\uFFFF without problems, but once you have any character beyond that,
your indexing and slicing are wrong.

3.2 wide is fine but memory-inefficient.

3.3 is never worse than 3.2 except for some tiny checks, and will be
more memory-efficient in many cases.

Supporting narrow would require fixing the handling of surrogates.
Potentially a huge job, and you'll end up with ridiculous performance
in many cases.

So what you're really asking for is a command-line option to force all
strings to have their 'kind' set to 11, UCS-4 storage. That would be
doable, I suppose; it wouldn't require many changes (just a quick
check in string creation functions). But what would be the advantage?
Every string requires 4 bytes per character to store; an optimization
has been lost.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28044

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-28 22:15 -0600
Message-ID	<mailman.3920.1346213765.4697.python-list@python.org>
In reply to	#28007

On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
> In summary:
> 1. The problem is not on jmf's computer
> 2. It is not windows-only
> 3. It is not directly related to latin-1 encodable or not
>
> The only question which is not yet clear is this:
> Given a typical string operation that is complexity O(n), in more
> detail it is going to be O(a + bn)
> If only a is worse going 3.2 to 3.3, it may be a small issue.
> If b is worse by even a tiny amount, it is likely to be a significant
> regression for some use-cases.

As has been pointed out repeatedly already, this is a microbenchmark.
jmf is focusing in one one particular area (string construction) where
Python 3.3 happens to be slower than Python 3.2, ignoring the fact
that real code usually does lots of things other than building
strings, many of which are slower to begin with.  In the real-world
benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
Here's a much more realistic benchmark that nonetheless still focuses
on strings: word counting.

Source: http://pastebin.com/RDeDsgPd

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 310 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
"wc.wc('unilang8.htm')"
1000 loops, best of 3: 302 usec per loop

"unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
of Unicode characters that I pulled off the web.  Even though this
program is still mostly string processing, Python 3.3 wins.  Of
course, that's not really a very good test -- since it reads the file
on every pass, it probably spends more time in I/O than it does in
actual processing.  Let's try it again with prepared string data:

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 87.3 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_str(t)"
10000 loops, best of 3: 84.6 usec per loop

Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
decided to try it again using str.split() instead of a StringIO.
Since str.split() creates more strings, I expect Python 3.2 might
actually win this time.

C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 88 usec per loop

C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
open('unilang8.htm', 'r', encoding
='utf-8').read()" "wc.wc_split(t)"
10000 loops, best of 3: 76.5 usec per loop

Interestingly, although Python 3.2 performs the splits in about the
same time as the StringIO operation, Python 3.3 is significantly
*faster* using str.split(), at least on this data set.

> So doing some arm-chair thinking (I dont know the code and difficulty
> involved):
>
> Clearly there are 3 string-engines in the python 3 world:
> - 3.2 narrow
> - 3.2 wide
> - 3.3 (flexible)
>
> How difficult would it be to giving the choice of string engine as a
> command-line flag?
> This would avoid the nuisance of having two binaries -- narrow and
> wide.

Quite difficult.  Even if we avoid having two or three separate
binaries, we would still have separate binary representations of the
string structs.  It makes the maintainability of the software go down
instead of up.

> And it would give the python programmer a choice of efficiency
> profiles.

So instead of having just one test for my Unicode-handling code, I'll
now have to run that same test *three times* -- once for each possible
string engine option.  Choice isn't always a good thing.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#28049

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-29 08:05 +0000
Message-ID	<503dcd35$0$9416$c3e8da3$76491128@news.astraweb.com>
In reply to	#28044

On Tue, 28 Aug 2012 22:15:31 -0600, Ian Kelly wrote:

> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:

>> How difficult would it be to giving the choice of string engine as a
>> command-line flag?
>> This would avoid the nuisance of having two binaries -- narrow and
>> wide.
> 
> Quite difficult.  Even if we avoid having two or three separate
> binaries, we would still have separate binary representations of the
> string structs.  It makes the maintainability of the software go down
> instead of up.

In fairness, there are already multiple binary representations of strings 
in Python 3.3:

- ASCII-only strings use a 1-byte format (PyASCIIObject);

- Compact Unicode objects (PyCompactObject), which if I'm reading
  correctly, appears to use a non-fixed width UTF-8 format, but are only
  used when the string length and maximum character are known ahead of
  time;

- Legacy string objects (PyUnicodeObject), which are not compact, and
  which may use as their internal format:

    * 1-byte characters for Latin1-compatible strings;

    * 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;

    * 4-byte UCS-4 characters for strings with at least one non-BMP
      character.

http://www.python.org/dev/peps/pep-0393/#specification


By my calculations, that makes *five* different internal formats for 
strings, at least two of which are capable of representing all Unicode 
characters. I don't think it would add that much additional complexity to 
have a runtime option --always-wide-strings to always use the UCS-4 
format. For, you know, crazy people with more memory than sense.

But I don't think there's any point in exposing further runtime options 
to choose the string representation:

- neither the ASCII nor Latin1 representations can store arbitrary
  Unicode chars, so they're out;

- the UTF-8 format is only used under restrictive circumstances, and so
  is (probably?) unsuitable for all strings.

- the UCS-2 format can, by using surrogate pairs, but that's troublesome
  to get right, some might even say buggy.


>> And it would give the python programmer a choice of efficiency
>> profiles.
> 
> So instead of having just one test for my Unicode-handling code, I'll
> now have to run that same test *three times* -- once for each possible
> string engine option.  Choice isn't always a good thing.

There is that too.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28055

From	wxjmfauth@gmail.com
Date	2012-08-29 04:40 -0700
Message-ID	<62566024-df1d-4948-a27a-45c7820ddc6c@googlegroups.com>
In reply to	#28044

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
> 
> > In summary:
> 
> > 1. The problem is not on jmf's computer
> 
> > 2. It is not windows-only
> 
> > 3. It is not directly related to latin-1 encodable or not
> 
> >
> 
> > The only question which is not yet clear is this:
> 
> > Given a typical string operation that is complexity O(n), in more
> 
> > detail it is going to be O(a + bn)
> 
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
> 
> > If b is worse by even a tiny amount, it is likely to be a significant
> 
> > regression for some use-cases.
> 
> 
> 
> As has been pointed out repeatedly already, this is a microbenchmark.
> 
> jmf is focusing in one one particular area (string construction) where
> 
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
> 
> that real code usually does lots of things other than building
> 
> strings, many of which are slower to begin with.  In the real-world
> 
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
> 
> Here's a much more realistic benchmark that nonetheless still focuses
> 
> on strings: word counting.
> 
> 
> 
> Source: http://pastebin.com/RDeDsgPd
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 310 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 302 usec per loop
> 
> 
> 
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
> 
> of Unicode characters that I pulled off the web.  Even though this
> 
> program is still mostly string processing, Python 3.3 wins.  Of
> 
> course, that's not really a very good test -- since it reads the file
> 
> on every pass, it probably spends more time in I/O than it does in
> 
> actual processing.  Let's try it again with prepared string data:
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 87.3 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 84.6 usec per loop
> 
> 
> 
> Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
> 
> decided to try it again using str.split() instead of a StringIO.
> 
> Since str.split() creates more strings, I expect Python 3.2 might
> 
> actually win this time.
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 88 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 76.5 usec per loop
> 
> 
> 
> Interestingly, although Python 3.2 performs the splits in about the
> 
> same time as the StringIO operation, Python 3.3 is significantly
> 
> *faster* using str.split(), at least on this data set.
> 
> 
> 
> 
> 
> > So doing some arm-chair thinking (I dont know the code and difficulty
> 
> > involved):
> 
> >
> 
> > Clearly there are 3 string-engines in the python 3 world:
> 
> > - 3.2 narrow
> 
> > - 3.2 wide
> 
> > - 3.3 (flexible)
> 
> >
> 
> > How difficult would it be to giving the choice of string engine as a
> 
> > command-line flag?
> 
> > This would avoid the nuisance of having two binaries -- narrow and
> 
> > wide.
> 
> 
> 
> Quite difficult.  Even if we avoid having two or three separate
> 
> binaries, we would still have separate binary representations of the
> 
> string structs.  It makes the maintainability of the software go down
> 
> instead of up.
> 
> 
> 
> > And it would give the python programmer a choice of efficiency
> 
> > profiles.
> 
> 
> 
> So instead of having just one test for my Unicode-handling code, I'll
> 
> now have to run that same test *three times* -- once for each possible
> 
> string engine option.  Choice isn't always a good thing.
> 
> 

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf

[toc] | [prev] | [next] | [standalone]

#28059

From	Dave Angel <d@davea.name>
Date	2012-08-29 08:01 -0400
Message-ID	<mailman.3929.1346241717.4697.python-list@python.org>
In reply to	#28055

On 08/29/2012 07:40 AM, wxjmfauth@gmail.com wrote:
> <snip>

> Forget Python and all these benchmarks. The problem is on an other
> level. Coding schemes, typography, usage of characters, ... For a
> given coding scheme, all code points/characters are equivalent.
> Expecting to handle a sub-range in a coding scheme without shaking
> that coding scheme is impossible. If a coding scheme does not give
> satisfaction, the only valid solution is to create a new coding
> scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case,
> where the "internal" coding depends on the fonts! Unicode (utf***), as
> just one another coding scheme, does not escape to this rule. This
> "Flexible String Representation" fails. Not only it is unable to stick
> with a coding scheme, it is a mixing of coding schemes, the worst of
> all possible implementations. jmf 

Nonsense.  The discussion was not about an encoding scheme, but an
internal representation.  That representation does not change the
programmer's interface in any way other than performance (cpu and memory
usage).   Most of the rest of your babble is unsupported opinion.

Plonk.



-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28067

From	wxjmfauth@gmail.com
Date	2012-08-29 08:43 -0700
Message-ID	<mailman.3938.1346254994.4697.python-list@python.org>
In reply to	#28059

Le mercredi 29 août 2012 14:01:57 UTC+2, Dave Angel a écrit :
> On 08/29/2012 07:40 AM, wxjmfauth@gmail.com wrote:
> 
> > <snip>
> 
> 
> 
> > Forget Python and all these benchmarks. The problem is on an other
> 
> > level. Coding schemes, typography, usage of characters, ... For a
> 
> > given coding scheme, all code points/characters are equivalent.
> 
> > Expecting to handle a sub-range in a coding scheme without shaking
> 
> > that coding scheme is impossible. If a coding scheme does not give
> 
> > satisfaction, the only valid solution is to create a new coding
> 
> > scheme, cp1252, mac-roman, EBCDIC, ... or the interesting "TeX" case,
> 
> > where the "internal" coding depends on the fonts! Unicode (utf***), as
> 
> > just one another coding scheme, does not escape to this rule. This
> 
> > "Flexible String Representation" fails. Not only it is unable to stick
> 
> > with a coding scheme, it is a mixing of coding schemes, the worst of
> 
> > all possible implementations. jmf 
> 
> 
> 
> Nonsense.  The discussion was not about an encoding scheme, but an
> 
> internal representation.  That representation does not change the
> 
> programmer's interface in any way other than performance (cpu and memory
> 
> usage).   Most of the rest of your babble is unsupported opinion.
> 

I can hit the nail a little more.
I have even a better idea and I'm serious.

If "Python" has found a new way to cover the set
of the Unicode characters, why not proposing it
to the Unicode consortium?

Unicode has already three schemes covering practically
all cases: memory consumption, maximum flexibility and
an intermediate solution.
It would be to bad, to not share it.

What do you think? ;-)

jmf

[toc] | [prev] | [next] | [standalone]

#28092

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-30 06:55 +0000
Message-ID	<503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>
In reply to	#28067

On Wed, 29 Aug 2012 08:43:05 -0700, wxjmfauth wrote:

> I can hit the nail a little more.
> I have even a better idea and I'm serious.
> 
> If "Python" has found a new way to cover the set of the Unicode
> characters, why not proposing it to the Unicode consortium?

Because the implementation of the str datatype in a programming language 
has nothing to do with the Unicode consortium. You might as well propose 
it to the International Union of Railway Engineers.

> Unicode has already three schemes covering practically all cases: memory
> consumption, maximum flexibility and an intermediate solution.

And Python's solution uses those: UCS-2, UCS-4, and UTF-8.

The only thing which is innovative here is that instead of the Python 
compiler declaring that "all strings will be stored in UCS-2", the 
compiler chooses an implementation for each string as needed. So some 
strings will be stored internally as UCS-4, some as UCS-2, and some as 
ASCII (which is a standard, but not the Unicode consortium's standard).

(And possibly some as UTF-8? I'm not entirely sure from reading the PEP.)

There's nothing radical here, honest.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28097

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-30 18:59 +1000
Message-ID	<mailman.3961.1346317170.4697.python-list@python.org>
In reply to	#28092

On Thu, Aug 30, 2012 at 6:51 PM,  <wxjmfauth@gmail.com> wrote:
> Pick up a random text and see the probability this
> text match the most optimized case 1 char / 1 byte,
> practically never.

Only if you talk about a huge document. Try, instead, every string
ever used in a Python script.

Practically always.

But I'm wasting my time saying this again. It's been said by multiple
people multiple times.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28100

From	Roy Smith <roy@panix.com>
Date	2012-08-30 07:02 -0400
Message-ID	<roy-947BF0.07022430082012@news.panix.com>
In reply to	#28092

In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> The only thing which is innovative here is that instead of the Python 
> compiler declaring that "all strings will be stored in UCS-2", the 
> compiler chooses an implementation for each string as needed. So some 
> strings will be stored internally as UCS-4, some as UCS-2, and some as 
> ASCII (which is a standard, but not the Unicode consortium's standard).

Is the implementation smart enough to know that x == y is always False 
if x and y are using different internal representations?

[toc] | [prev] | [next] | [standalone]

#28133

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-30 16:00 +0000
Message-ID	<503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28100

On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:

> In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
>> The only thing which is innovative here is that instead of the Python
>> compiler declaring that "all strings will be stored in UCS-2", the
>> compiler chooses an implementation for each string as needed. So some
>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>> ASCII (which is a standard, but not the Unicode consortium's standard).
> 
> Is the implementation smart enough to know that x == y is always False
> if x and y are using different internal representations?

But x and y are not necessarily always False just because they have 
different representations. There may be circumstances where two strings 
have different internal representations even though their content is the 
same, so it's an unsafe optimization to automatically treat them as 
unequal.

The closest existing equivalent here is the relationship between ints and 
longs in Python 2. 42 == 42L even though they have different internal 
representations and take up a different amount of space.

My expectation is that the initial implementation of PEP 393 will be 
relatively unoptimized, and over the next few releases it will get more 
efficient. That's usually the way these things go.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28140

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-08-30 16:44 -0400
Message-ID	<mailman.3985.1346359524.4697.python-list@python.org>
In reply to	#28133

On 8/30/2012 12:00 PM, Steven D'Aprano wrote:
> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
>
>> In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>,
>>   Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>>
>>> The only thing which is innovative here is that instead of the Python
>>> compiler declaring that "all strings will be stored in UCS-2", the
>>> compiler chooses an implementation for each string as needed. So some
>>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>>> ASCII (which is a standard, but not the Unicode consortium's standard).
>>
>> Is the implementation smart enough to know that x == y is always False
>> if x and y are using different internal representations?

Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
     int result;

     if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
         PyObject *v;
         if (PyUnicode_READY(left) == -1 ||
             PyUnicode_READY(right) == -1)
             return NULL;
         if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
             PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
             if (op == Py_EQ) {
                 Py_INCREF(Py_False);
                 return Py_False;
             }
             if (op == Py_NE) {
                 Py_INCREF(Py_True);
                 return Py_True;
             }
         }
...
KIND is 1,2,4 bytes/char

'a in s' is also False if a chars are wider than s chars.

If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast, 
constant time operation, as I showed earlier in this discussion. This is 
one thing that is much faster in 3.3.

Such things can be tested by timing with different lengths of strings, 
where the initial string creation is done in setup code rather than in 
the repeated operation code.

> But x and y are not necessarily always False just because they have
> different representations. There may be circumstances where two strings
> have different internal representations even though their content is the
> same, so it's an unsafe optimization to automatically treat them as
> unequal.

I am sure that str objects are always in canonical form once visible to 
Python code. Note that unready (non-canonical) objects are rejected by 
the rich comparison function.

> My expectation is that the initial implementation of PEP 393 will be
> relatively unoptimized,

The initial implementation was a year ago. At least three people have 
expended considerable effort improving it since, so that the slowdown 
mentioned in the PEP has mostly disappeared. The things that are still 
slower are somewhat balanced by things that are faster.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28172

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-31 12:32 +0000
Message-ID	<5040aed8$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28140

On Thu, 30 Aug 2012 16:44:32 -0400, Terry Reedy wrote:

> On 8/30/2012 12:00 PM, Steven D'Aprano wrote:
>> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
[...]
>>> Is the implementation smart enough to know that x == y is always False
>>> if x and y are using different internal representations?
> 
> Yes, after checking lengths, and in same circumstances, x != y is True.
[snip C code]

Thanks Terry for looking that up.

> 'a in s' is also False if a chars are wider than s chars.

Now that's a nice optimization!

[...]
>> But x and y are not necessarily always False just because they have
>> different representations. There may be circumstances where two strings
>> have different internal representations even though their content is
>> the same, so it's an unsafe optimization to automatically treat them as
>> unequal.
> 
> I am sure that str objects are always in canonical form once visible to
> Python code. Note that unready (non-canonical) objects are rejected by
> the rich comparison function.

That's one thing that I'm unclear about -- under what circumstances will 
a string be in compact versus non-compact form? Reading between the 
lines, I guess that a lot of the complexity of the implementation only 
occurs while a string is being built. E.g. if you have Python code like 
this:

''.join(str(x) for x in something)  # a generator expression

Python can't tell how much space to allocate for the string -- it doesn't 
know either the overall length of the string or the width of the 
characters. So I presume that there is string builder code for dealing 
with that, and that it involves resizing blocks of memory.

But if you do this:

''.join([str(x) for x in something])  # a list comprehension

Python could scan the list first, find out the widest char, and allocate 
exactly the amount of space needed for the string. Even in Python 2, 
joining a list comp is much faster than joining a gen expression.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28182

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-31 09:13 -0600
Message-ID	<mailman.3.1346426052.27098.python-list@python.org>
In reply to	#28172

On Fri, Aug 31, 2012 at 6:32 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> That's one thing that I'm unclear about -- under what circumstances will
> a string be in compact versus non-compact form?

I understand it to be entirely dependent on which API is used to
construct.  The legacy API generates legacy strings, and the new API
generates compact strings.  From the comments in unicodeobject.h:

    /* ASCII-only strings created through PyUnicode_New use the PyASCIIObject
    structure. state.ascii and state.compact are set, and the data
    immediately follow the structure. utf8_length and wstr_length can be found
    in the length field; the utf8 pointer is equal to the data pointer. */

...

    Legacy strings are created by PyUnicode_FromUnicode() and
    PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
    when PyUnicode_READY() is called.

...

    /* Non-ASCII strings allocated through PyUnicode_New use the
    PyCompactUnicodeObject structure. state.compact is set, and the data
    immediately follow the structure. */

Since I'm not sure that this is clear, note that compact vs. legacy
does not describe which character width is used (except that
PyASCIIObject strings are always 1 byte wide).  Legacy and compact
strings can each use the 1, 2, or 4 byte representations.  "Compact"
merely denotes that the character data is stored inline with the
struct (as opposed to being stored somewhere else and pointed at by
the struct), not the relative size of the string data.  Again from the
comments:

    Compact strings use only one memory block (structure + characters),
    whereas legacy strings use one block for the structure and one block
    for characters.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#28173

From	Roy Smith <roy@panix.com>
Date	2012-08-31 08:43 -0400
Message-ID	<roy-08D029.08435531082012@news.panix.com>
In reply to	#28133

In article <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
> > Is the implementation smart enough to know that x == y is always False
> > if x and y are using different internal representations?
> 
> [...] There may be circumstances where two strings have different 
> internal representations even though their content is the same

If there is a deterministic algorithm which maps string content to 
representation type, then I don't see how it's possible for two strings 
with different representation types to have the same content.  Could you 
give me an example of when this might happen?

[toc] | [prev] | [next] | [standalone]

#28181

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-31 14:54 +0000
Message-ID	<5040d032$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28173

On Fri, 31 Aug 2012 08:43:55 -0400, Roy Smith wrote:

> In article <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
>> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
>> > Is the implementation smart enough to know that x == y is always
>> > False if x and y are using different internal representations?
>> 
>> [...] There may be circumstances where two strings have different
>> internal representations even though their content is the same
> 
> If there is a deterministic algorithm which maps string content to
> representation type, then I don't see how it's possible for two strings
> with different representation types to have the same content.  Could you
> give me an example of when this might happen?

There are deterministic algorithms which can result in the same result 
with two different internal formats. Here's an example from Python 2:

py> sum([1, 2**30, -2**30, 2**30, -2**30])
1
py> sum([1, 2**30, 2**30, -2**30, -2**30])
1L

The internal representation (int versus long) differs even though the sum 
is the same.

A second example: the order of keys in a dict is deterministic but 
unpredictable, as it depends on the history of insertions and deletions 
into the dict. So two dicts could be equal, and yet have radically 
different internal layout.

One final example: list resizing. Here are two lists which are equal but 
have different sizes:

py> a = [0]
py> b = range(10000)
py> del b[1:]
py> a == b
True
py> sys.getsizeof(a)
36
py> sys.getsizeof(b)
48

Is PEP 393 another example of this? I have no idea. Somebody who is more 
familiar with the details of the implementation would be able to answer 
whether or not that is the case. I'm just suggesting that it is possible.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28126

From	Antoine Pitrou <solipsis@pitrou.net>
Date	2012-08-30 15:01 +0000
Message-ID	<mailman.3974.1346338910.4697.python-list@python.org>
In reply to	#28092

<wxjmfauth <at> gmail.com> writes:
> 
> Pick up a random text and see the probability this
> text match the most optimized case 1 char / 1 byte,
> practically never.

Funny that you posted a text which does just that:
http://mail.python.org/pipermail/python-list/2012-August/629554.html

> In a funny way, this is what Python was doing and it
> performs better!

I honestly suggest you shut up until you have a clue.

Regards

Antoine.

[toc] | [prev] | [next] | [standalone]

#28245

From	wxjmfauth@gmail.com
Date	2012-09-02 00:36 -0700
Message-ID	<2a12ba52-232a-41b7-a906-1ec379bbddd7@googlegroups.com>
In reply to	#28126

Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :
> 
> 
> I honestly suggest you shut up until you have a clue.
> 
Désolé Antoine,

I have not the knowledge to dive in the Python code,
but I know what is a character.

The coding of the characters is a domain per se,
independent from the os, from the computer languages.

Before spending time to implement a new algorithm,
maybe it is better to ask, if there is something
better than the actual schemes.

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

Real world exemples.

>>> import libfrancais
>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
...     'noétique', 'noèse', 'noirâtre']
>>> r = libfrancais.sortfr(li)
>>> r
['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

(cf "Le Petit Robert")

or

The *letters* satisfying the requirements of the
"Imprimerie nationale".

jmf

[toc] | [prev] | [next] | [standalone]

#28249

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-02 09:58 +0100
Message-ID	<mailman.66.1346576268.27098.python-list@python.org>
In reply to	#28245

On 02/09/2012 08:36, wxjmfauth@gmail.com wrote:
> Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :
>>
>>
>> I honestly suggest you shut up until you have a clue.
>>
> Désolé Antoine,
>
> I have not the knowledge to dive in the Python code,
> but I know what is a character.

You're a character, and from my observations on this thread you're very 
humorous. YMMV.

>
> The coding of the characters is a domain per se,
> independent from the os, from the computer languages.
>
> Before spending time to implement a new algorithm,
> maybe it is better to ask, if there is something
> better than the actual schemes.

Please write a new PEP indicating how you would correct your perceived 
deficiencies with PEP 393 and its implementation.

>
> I still remember my thoughts when I read the PEP 393
> discussion: "this is not logical", "they do no understand
> typography", "atomic character ???", ...

When PEP 393 was first drafted how much input did you give during the 
acceptance process, if any?

>
> Real world exemples.
>
>>>> import libfrancais
>>>> li = ['noël', 'noir', 'nœud', 'noduleux', \

Why the unneeded continuation character, fancy wasting storage space?

> ...     'noétique', 'noèse', 'noirâtre']
>>>> r = libfrancais.sortfr(li)
>>>> r
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 'noirâtre']

What has sorting foreign words got to do with the internal representaion 
of the individual characters?

>
> (cf "Le Petit Robert")
>
> or
>
> The *letters* satisfying the requirements of the
> "Imprimerie nationale".
>
> jmf
>

I've just rechecked my calendar and it's definitly not 1st April today. 
  Poor old me I'm baffled as always.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

Page 2 of 5 — ← Prev page 1 [2] 3 4 5 Next page →

csiph-web

Re: Flexible string representation, unicode, typography, ...

Contents

#28054

#28007

#28042

#28044

#28049

#28055

#28059

#28067

#28092

#28097

#28100

#28133

#28140

#28172

#28182

#28173

#28181

#28126

#28245

#28249