Groups > comp.lang.python > #27843 > unrolled thread

Re: Flexible string representation, unicode, typography, ...

Started by	Antoine Pitrou <solipsis@pitrou.net>
First post	2012-08-25 00:24 +0000
Last post	2012-08-25 07:23 -0400
Articles	20 on this page of 83 — 18 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
    Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
      Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
        Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
              Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
                Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
                    Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
                        Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
                          Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
                          Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                              Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
                                Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
                                  Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
                                    Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
                                    Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
                                            Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
                                          Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
                                    Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                            Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
                                            Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
                                              Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
                                                Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
                                        Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
                                            Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
                                            Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
                                            Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
                                          Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
                                        Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
                                          Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                            Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
                                          Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
                                        Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
                                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
                                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
                                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
                                        Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
                                        Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
                                        Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
                                    Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
                                Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
                              Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
                            Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
                      Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
                    Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
                      Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
                        Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
      Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
      Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
      Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400

Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →

#28272

From	Ramchandra Apte <maniandram01@gmail.com>
Date	2012-09-02 06:48 -0700
Message-ID	<5c453ede-33dd-4b7f-aa53-9424224ec6c7@googlegroups.com>
In reply to	#28268

On Sunday, 2 September 2012 17:53:16 UTC+5:30, Mark Lawrence  wrote:
> On 02/09/2012 13:00, Serhiy Storchaka wrote:
> 
> > On 02.09.12 12:52, Peter Otten wrote:
> 
> >> Ian Kelly wrote:
> 
> >>
> 
> >>> Rewriting the example to use locale.strcoll instead:
> 
> >>
> 
> >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> >>
> 
> >> There is also locale.strxfrm() which you can use directly:
> 
> >>
> 
> >> sorted(li, key=locale.strxfrm)
> 
> >
> 
> > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
> 
> >
> 
> >
> 
> 
> 
> That's it then I'm giving up with Python.  In future I'll be writing 
> 
> everything in machine code to ensure that I get the fastest possible run 
> 
> times.
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

please make it  *heavily optimized* machine code

[toc] | [prev] | [next] | [standalone]

#28276

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-02 15:46 +0100
Message-ID	<mailman.90.1346597079.27098.python-list@python.org>
In reply to	#28272

On 02/09/2012 14:48, Ramchandra Apte wrote:
>
> please make it  *heavily optimized* machine code
>

Goes without saying.  First thing I'll concentrate on is removing 
superfluous newlines sent by crappy mail clients or similar.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#28273

From	Ramchandra Apte <maniandram01@gmail.com>
Date	2012-09-02 06:48 -0700
Message-ID	<mailman.87.1346593749.27098.python-list@python.org>
In reply to	#28268

On Sunday, 2 September 2012 17:53:16 UTC+5:30, Mark Lawrence  wrote:
> On 02/09/2012 13:00, Serhiy Storchaka wrote:
> 
> > On 02.09.12 12:52, Peter Otten wrote:
> 
> >> Ian Kelly wrote:
> 
> >>
> 
> >>> Rewriting the example to use locale.strcoll instead:
> 
> >>
> 
> >>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> >>
> 
> >> There is also locale.strxfrm() which you can use directly:
> 
> >>
> 
> >> sorted(li, key=locale.strxfrm)
> 
> >
> 
> > Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.
> 
> >
> 
> >
> 
> 
> 
> That's it then I'm giving up with Python.  In future I'll be writing 
> 
> everything in machine code to ensure that I get the fastest possible run 
> 
> times.
> 
> 
> 
> -- 
> 
> Cheers.
> 
> 
> 
> Mark Lawrence.

please make it  *heavily optimized* machine code

[toc] | [prev] | [next] | [standalone]

#28364

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-03 12:33 -0600
Message-ID	<mailman.153.1346697242.27098.python-list@python.org>
In reply to	#28245

On Sun, Sep 2, 2012 at 6:00 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
> On 02.09.12 12:52, Peter Otten wrote:
>>
>> Ian Kelly wrote:
>>
>>> Rewriting the example to use locale.strcoll instead:
>>
>>
>>>>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
>>
>>
>> There is also locale.strxfrm() which you can use directly:
>>
>> sorted(li, key=locale.strxfrm)
>
>
> Hmm, and with locale.strxfrm Python 3.3 20% slower than 3.2.

Doh!  In Python 3.3, strcoll and strxfrm are the same speed, so I
guess that the actual optimization I'm seeing here is that in Python
3.3, cmp_to_key(strcoll) has been optimized to return strxfrm.

[toc] | [prev] | [next] | [standalone]

#28246

From	wxjmfauth@gmail.com
Date	2012-09-02 00:36 -0700
Message-ID	<mailman.63.1346571419.27098.python-list@python.org>
In reply to	#28126

Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit :
> 
> 
> I honestly suggest you shut up until you have a clue.
> 
Désolé Antoine,

I have not the knowledge to dive in the Python code,
but I know what is a character.

The coding of the characters is a domain per se,
independent from the os, from the computer languages.

Before spending time to implement a new algorithm,
maybe it is better to ask, if there is something
better than the actual schemes.

I still remember my thoughts when I read the PEP 393
discussion: "this is not logical", "they do no understand
typography", "atomic character ???", ...

Real world exemples.

>>> import libfrancais
>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
...     'noétique', 'noèse', 'noirâtre']
>>> r = libfrancais.sortfr(li)
>>> r
['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
'noirâtre']

(cf "Le Petit Robert")

or

The *letters* satisfying the requirements of the
"Imprimerie nationale".

jmf

[toc] | [prev] | [next] | [standalone]

#28134

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-30 10:27 -0600
Message-ID	<mailman.3976.1346344057.4697.python-list@python.org>
In reply to	#28092

On Thu, Aug 30, 2012 at 2:51 AM,  <wxjmfauth@gmail.com> wrote:
> But as soon as you introduce artificially a "latin-1"
> bottleneck, all this machinery just become useless.

How is this a bottleneck?  If you removed the Latin-1 encoding
altogether and limited the flexible representation to just UCS-2 /
UCS-4, I doubt very much that you would see any significant speed
gains. The flexibility is the part that makes string creation slower,
not the Latin-1 option in particular.

> This flexible representation is working absurdly.
> It optimizes the characters you are not using (in one
> sense), it defaults to a non optimized form for the
> characters you wish to use.

I'm sure that if you wanted to you could patch Python to use Latin-9
instead.  Just be prepared for it to be slower than UCS-2, since it
would mean having to encode the code points rather than merely
truncating them.

> Pick up a random text and see the probability this
> text match the most optimized case 1 char / 1 byte,
> practically never.

Pick up a random text and see that this text matches the next most
optimized case, 1 char / 2 bytes: practically always.

> If a user will use exclusively latin-1, she/he is  better
> served by using a dedicated tool for "latin-1"

Speaker as a user who almost exclusively uses Latin-1, I strongly
disagree.  What you're describing is Python 2.x.  The user is always
almost better served by not having to worry about the full extent of
the character set their program might use.  That's why we moved to
Unicode strings in Python 3 in the first place.

> If a user will comfortably work with Unicode, she/he is
> better served by using one of this tools which is using
> properly one of the available Unicode schemes.
>
> In a funny way, this is what Python was doing and it
> performs better!

Seriously, please show us just one *real world* benchmark in which
Python 3.3 performs demonstrably worse than Python 3.2.  All you've
shown so far is this one microbenchmark of string creation that is
utterly irrelevant to actual programs.

[toc] | [prev] | [next] | [standalone]

#28317

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-02 23:38 +0300
Message-ID	<mailman.115.1346618346.27098.python-list@python.org>
In reply to	#28092

On 30.08.12 09:55, Steven D'Aprano wrote:
> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.

I see that this misconception widely spread. In fact Python 3.3 uses 
four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Indexing is O(0) for any string.

Also the string can optionally cache UTF-8 and wchar_t* representation.

[toc] | [prev] | [next] | [standalone]

#28333

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-03 01:54 +0000
Message-ID	<50440de2$0$29967$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28317

On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote:

> On 30.08.12 09:55, Steven D'Aprano wrote:
>> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.
> 
> I see that this misconception widely spread.

I am not familiar enough with the C implementation to tell what Python 
3.3 actually does, and the PEP assumes a fair amount of familiarity with 
the CPython source. So I welcome corrections.

> In fact Python 3.3 uses four kinds of ready strings.
> 
> * ASCII. All codes <= U+007F.
> * UCS1. All codes <= U+00FF, at least one code > U+007F. 
> * UCS2. All codes <= U+FFFF, at least one code > U+00FF. 
> * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including 
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

> Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.

> Also the string can optionally cache UTF-8 and wchar_t* representation.

Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not 
the canonical representation.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28334

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-09-02 22:33 -0400
Message-ID	<mailman.124.1346639651.27098.python-list@python.org>
In reply to	#28333

On 9/2/2012 9:54 PM, Steven D'Aprano wrote:
> On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote:
>
>> On 30.08.12 09:55, Steven D'Aprano wrote:
>>> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.
>>
>> I see that this misconception widely spread.
>
> I am not familiar enough with the C implementation to tell what Python
> 3.3 actually does, and the PEP assumes a fair amount of familiarity with
> the CPython source. So I welcome corrections.
>
>
>> In fact Python 3.3 uses four kinds of ready strings.
>>
>> * ASCII. All codes <= U+007F.
>> * UCS1. All codes <= U+00FF, at least one code > U+007F.
>> * UCS2. All codes <= U+FFFF, at least one code > U+00FF.
>> * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.
>
> Where UCS1 is equivalent to Latin-1, correct?
>
> UCS2 is what Python 3.2 narrow builds uses for all strings, including
> codes > U+FFFF using surrogate pairs.
>
> UCS4 is what Python 3.2 wide builds uses for all strings.
>
> This means that Python 3.3 will no longer have surrogate pairs.

Basically, yes. I believe CPython will only use surrogate code points if 
one requests errors=surrogate-escape on decoding or explicitly puts them 
in a literal (\unnnn or \Ummmmmmmm). The consequences fall under the 
'consenting adults' policy.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#28358

From	Roy Smith <roy@panix.com>
Date	2012-09-03 11:24 -0400
Message-ID	<roy-4C0CCA.11245603092012@news.panix.com>
In reply to	#28333

In article <50440de2$0$29967$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> > Indexing is O(0) for any string.
> 
> I think you mean O(1) for constant-time lookups.

Why settle for constant-time, when you can have zero-time instead :-)

[toc] | [prev] | [next] | [standalone]

#28360

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 18:41 +0300
Message-ID	<mailman.149.1346686927.27098.python-list@python.org>
In reply to	#28333

On 03.09.12 04:54, Steven D'Aprano wrote:
> This means that Python 3.3 will no longer have surrogate pairs.
>
> Am I right?

As Terry said, basically, yes. Python 3.3 does not need in surrogate 
pairs, but does not prevent their creation. You can create a surrogate 
code (U+D800..U+DFFF) intentionally (as you can create a single accent 
modifier or other senseless alone charcode), but less likely that you 
will get them unintentionally.

[toc] | [prev] | [next] | [standalone]

#28323

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-09-03 00:45 +0300
Message-ID	<mailman.117.1346622353.27098.python-list@python.org>
In reply to	#28092

On 02.09.12 23:38, Serhiy Storchaka wrote:
> Indexing is O(0) for any string.

Typo. O(1)

[toc] | [prev] | [next] | [standalone]

#28068

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-30 01:54 +1000
Message-ID	<mailman.3939.1346255667.4697.python-list@python.org>
In reply to	#28059

On Thu, Aug 30, 2012 at 1:43 AM,  <wxjmfauth@gmail.com> wrote:
> If "Python" has found a new way to cover the set
> of the Unicode characters, why not proposing it
> to the Unicode consortium?

Python's open source. If some other language wants to borrow the idea,
they can look at the code, or alternatively, just read PEP 393 and
implement something similar. It's a free world.

By the way, can you please trim the quoted text in your replies? It's
rather lengthy.

ChrisA

[toc] | [prev] | [next] | [standalone]

#28060

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-29 22:34 +1000
Message-ID	<mailman.3930.1346243680.4697.python-list@python.org>
In reply to	#28055

On Wed, Aug 29, 2012 at 9:40 PM,  <wxjmfauth@gmail.com> wrote:
> For a given coding scheme, all code points/characters are
> equivalent. Expecting to handle a sub-range in a coding
> scheme without shaking that coding scheme is impossible.

Not all codepoints are equally likely. That's the whole point behind
variable-length encodings like Huffman compression (eg deflation as
used in zip/gzip), UTF-8, quoted-printable, and Morse code. They
handle a sub-range efficiently and the rest of the range less
efficiently.

> If a coding scheme does not give satisfaction, the only
> valid solution is to create a new coding scheme, cp1252,
> mac-roman, EBCDIC, ... or the interesting "TeX" case, where
> the "internal" coding depends on the fonts!

http://xkcd.com/927/

> This "Flexible String Representation" fails. Not only
> it is unable to stick with a coding scheme, it is
> a mixing of coding schemes, the worst of all possible
> implementations.

I propose, then, that we abolish files. Who *knows* how many different
things might be represented in a file! We need a single coding scheme
that can handle everything, without changing representation. This
ridiculous state of affairs must not go on; the same representation
can be used for bitmapped images or raw audio data!

ChrisA

[toc] | [prev] | [next] | [standalone]

#28056

From	wxjmfauth@gmail.com
Date	2012-08-29 04:40 -0700
Message-ID	<mailman.3927.1346240457.4697.python-list@python.org>
In reply to	#28044

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody@gmail.com> wrote:
> 
> > In summary:
> 
> > 1. The problem is not on jmf's computer
> 
> > 2. It is not windows-only
> 
> > 3. It is not directly related to latin-1 encodable or not
> 
> >
> 
> > The only question which is not yet clear is this:
> 
> > Given a typical string operation that is complexity O(n), in more
> 
> > detail it is going to be O(a + bn)
> 
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
> 
> > If b is worse by even a tiny amount, it is likely to be a significant
> 
> > regression for some use-cases.
> 
> 
> 
> As has been pointed out repeatedly already, this is a microbenchmark.
> 
> jmf is focusing in one one particular area (string construction) where
> 
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
> 
> that real code usually does lots of things other than building
> 
> strings, many of which are slower to begin with.  In the real-world
> 
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
> 
> Here's a much more realistic benchmark that nonetheless still focuses
> 
> on strings: word counting.
> 
> 
> 
> Source: http://pastebin.com/RDeDsgPd
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 310 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 302 usec per loop
> 
> 
> 
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
> 
> of Unicode characters that I pulled off the web.  Even though this
> 
> program is still mostly string processing, Python 3.3 wins.  Of
> 
> course, that's not really a very good test -- since it reads the file
> 
> on every pass, it probably spends more time in I/O than it does in
> 
> actual processing.  Let's try it again with prepared string data:
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 87.3 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 84.6 usec per loop
> 
> 
> 
> Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
> 
> decided to try it again using str.split() instead of a StringIO.
> 
> Since str.split() creates more strings, I expect Python 3.2 might
> 
> actually win this time.
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 88 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 76.5 usec per loop
> 
> 
> 
> Interestingly, although Python 3.2 performs the splits in about the
> 
> same time as the StringIO operation, Python 3.3 is significantly
> 
> *faster* using str.split(), at least on this data set.
> 
> 
> 
> 
> 
> > So doing some arm-chair thinking (I dont know the code and difficulty
> 
> > involved):
> 
> >
> 
> > Clearly there are 3 string-engines in the python 3 world:
> 
> > - 3.2 narrow
> 
> > - 3.2 wide
> 
> > - 3.3 (flexible)
> 
> >
> 
> > How difficult would it be to giving the choice of string engine as a
> 
> > command-line flag?
> 
> > This would avoid the nuisance of having two binaries -- narrow and
> 
> > wide.
> 
> 
> 
> Quite difficult.  Even if we avoid having two or three separate
> 
> binaries, we would still have separate binary representations of the
> 
> string structs.  It makes the maintainability of the software go down
> 
> instead of up.
> 
> 
> 
> > And it would give the python programmer a choice of efficiency
> 
> > profiles.
> 
> 
> 
> So instead of having just one test for my Unicode-handling code, I'll
> 
> now have to run that same test *three times* -- once for each possible
> 
> string engine option.  Choice isn't always a good thing.
> 
> 

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf

[toc] | [prev] | [next] | [standalone]

#27995

From	wxjmfauth@gmail.com
Date	2012-08-27 12:16 -0700
Message-ID	<mailman.3882.1346094990.4697.python-list@python.org>
In reply to	#27947

Le dimanche 26 août 2012 22:45:09 UTC+2, Dan Sommers a écrit :
> On 2012-08-26 at 20:13:21 +0000,
> 
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
> 
> 
> > I note that not all 32-bit ints are valid code points. I suppose I can
> 
> > see sense in having rune be a 32-bit integer value limited to those
> 
> > valid code points. (But, dammit, why not call it a code point?) But if
> 
> > rune is merely an alias for int32, why not just call it int32?
> 
> 
> 
> Having a "code point" type is a good idea.  If nothing else, human code
> 
> readers can tell that you're doing something with characters rather than
> 
> something with integers.  If your language provides any sort of type
> 
> safety, then you get that, too.
> 
> 
> 
> Calling your code points int32 is a bad idea for the same reason that it
> 
> turned out to be a bad idea to call all my old ASCII characters int8.
> 
> Or all my pointers int<n> (or unsigned int<n>), for n in 16, 20, 24, 32,
> 
> 36, 48, or 64 (or I'm sure other values of n that I never had the pain
> 
> or pleasure of using).
> 

And this is precisely the concept of rune, a real int which
is a name for Unicode code point.

Go "has" the integers int32 and int64. A rune ensure
the usage of int32. "Text libs" use runes. Go has only
bytes and runes.

If you do not like the word "perfection", this mechanism
has at least an ideal simplicity (with probably a lot
of positive consequences).

rune -> int32 -> utf32 -> unicode code points.

- Why int32 and not uint32? No idea, I tried to find an
answer without asking.
- I find the name "rune" elegant. "char" would have been
too confusing.

End. This is supposed to be a Python forum.
jmf

[toc] | [prev] | [next] | [standalone]

#27949

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-26 15:42 -0600
Message-ID	<mailman.3855.1346017353.4697.python-list@python.org>
In reply to	#27946

On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote:
>
>> I think the documentation for those functions is simply badly worded.
>> The "width in bytes" it returns is not the width of the rune (which as
>> jmf notes is simply an alias for int32 that stores a single code point).
>
> Is this documented somewhere?

http://golang.org/ref/spec#Numeric_types

[toc] | [prev] | [next] | [standalone]

#27955

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-26 23:31 +0000
Message-ID	<503ab1d9$0$1555$c3e8da3$76491128@news.astraweb.com>
In reply to	#27949

On Sun, 26 Aug 2012 15:42:00 -0600, Ian Kelly wrote:

> On Sun, Aug 26, 2012 at 2:13 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Sun, 26 Aug 2012 09:40:13 -0600, Ian Kelly wrote:
>>
>>> I think the documentation for those functions is simply badly worded.
>>> The "width in bytes" it returns is not the width of the rune (which as
>>> jmf notes is simply an alias for int32 that stores a single code
>>> point).
>>
>> Is this documented somewhere?
> 
> http://golang.org/ref/spec#Numeric_types

Thanks.

Well that's just plain nuts.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27956

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-26 17:47 -0700
Message-ID	<7x6285kvp5.fsf@ruckus.brouhaha.com>
In reply to	#27955

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>> http://golang.org/ref/spec#Numeric_types
> Thanks.
> Well that's just plain nuts.

I'm not sure how Rust handles Unicode, but overall I think it is more
clueful than Go while having sort of comparable goals.  See:
http://rust-lang.org .

[toc] | [prev] | [next] | [standalone]

#27864

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-25 21:04 +1000
Message-ID	<mailman.3796.1345892653.4697.python-list@python.org>
In reply to	#27854

On Sat, Aug 25, 2012 at 7:46 PM, Frank Millman <frank@chagford.com> wrote:
> Therefore, I think he is saying that he would have preferred that python
> standardise on 4-byte characters, on the grounds that the saving in memory
> does not justify the performance overhead.

If that's indeed the argument, then at least it's something to argue.
What gets difficult is when people complain about the expansion from a
2-byte narrow build to the current 1/2/4-byte representation, which
will indeed use more memory if there are a small number of >0xFFFF
codepoints. But there's a correctness difference there.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →

csiph-web

Re: Flexible string representation, unicode, typography, ...

Contents

#28272

#28276

#28273

#28364

#28246

#28134

#28317

#28333

#28334

#28358

#28360

#28323

#28068

#28060

#28056

#27995

#27949

#27955

#27956

#27864