Groups > comp.lang.python > #35115 > unrolled thread

Py 3.3, unicode / upper()

Started by	wxjmfauth@gmail.com
First post	2012-12-19 06:23 -0800
Last post	2012-12-20 17:34 -0700
Articles	20 on this page of 47 — 13 participants

Back to article view | Back to comp.lang.python

  Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 06:23 -0800
    Re: Py 3.3, unicode / upper() Thomas Bach <thbach@students.uni-mainz.de> - 2012-12-19 15:43 +0100
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 15:52 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
        Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:23 -0700
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
        Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:01 +1100
        Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 18:53 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
    Re: Py 3.3, unicode / upper() Stefan Krah <stefan-usenet@bytereef.org> - 2012-12-19 16:01 +0100
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:17 +1100
    Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:18 +0100
      Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:22 +0100
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:40 +1100
        Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-20 15:57 +0100
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 11:27 -0700
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
          Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:31 -0700
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
              Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:48 -0500
              Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 22:51 +0000
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 19:39 -0500
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:03 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 21:54 -0500
      Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 19:12 -0800
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 14:22 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 00:32 -0500
        Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 05:51 +0000
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
          Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:30 -0500
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
      Re: Py 3.3, unicode / upper() Serhiy Storchaka <storchaka@gmail.com> - 2012-12-27 21:00 +0200
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 16:33 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
    Re: Py 3.3, unicode / upper() Benjamin Peterson <benjamin@python.org> - 2012-12-19 20:25 +0000
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:19 -0800
      Re: Py 3.3, unicode / upper() MRAB <python@mrabarnett.plus.com> - 2012-12-20 20:20 +0000
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-21 08:19 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:12 -0500
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:59 -0500
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-20 17:34 -0700

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

#35237

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-20 17:48 -0500
Message-ID	<mailman.1118.1356043711.29569.python-list@python.org>
In reply to	#35214

On 12/20/2012 2:40 PM, wxjmfauth@gmail.com wrote:

> What should a Python user think, if he sees his strings
> are comsuming more memory just because he uses non ascii
> characters

What should a Python user think, if he (or she) sees his (or her) 
strings sometimes or often consuming less memory than they did previously?

I think the person should be grateful that people volunteered to make 
the improvement, rather than ungratefully bitch about it.

 > or he sees his strings are changing just because
> he "uppercases" them.

Uppercasing strings is supposed to change strings.

> Unicode is here to serve anybody.

This we agree on. Python3.3 unicode serves everybody better than 3.2 does.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35238

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-12-20 22:51 +0000
Message-ID	<50d3965a$0$29967$c3e8da3$5496439d@news.astraweb.com>
In reply to	#35214

On Thu, 20 Dec 2012 11:40:21 -0800, wxjmfauth wrote:

> I do not care
> about this optimization. I'm not an ascii user. As a non ascii user,
> this optimization is just irrelevant.

WRONG.

Every Python user is an ASCII user. Every Python program has hundreds or 
thousands of ASCII strings.

# === example ===
import random

There's already one ASCII string in your code: the module name "random" 
is ASCII. Let's look inside that module:

py> dir(random)
['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST', 
'SystemRandom', 'TWOPI', '_BuiltinMethodType', '_MethodType', 
'_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__', 
'__file__', '__initializing__', '__loader__', '__name__', '__package__', 
'_acos', '_ceil', '_cos', '_e', '_exp', '_inst', '_log', '_pi', 
'_random', '_sha512', '_sin', '_sqrt', '_test', '_test_generator', 
'_urandom', '_warn', 'betavariate', 'choice', 'expovariate', 
'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate', 
'normalvariate', 'paretovariate', 'randint', 'random', 'randrange', 
'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform', 
'vonmisesvariate', 'weibullvariate']

That's another 58 ASCII strings. Let's pick one of those:

py> dir(random.Random)
['VERSION', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', 
'__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', 
'__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', 
'__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', 
'__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', 
'__subclasshook__', '__weakref__', '_randbelow', 'betavariate', 'choice', 
'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate', 
'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random', 
'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular', 
'uniform', 'vonmisesvariate', 'weibullvariate']

That's another 51 ASCII strings. Let's pick one of them:

py> dir(random.Random.shuffle)
['__annotations__', '__call__', '__class__', '__closure__', '__code__', 
'__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', 
'__eq__', '__format__', '__ge__', '__get__', '__getattribute__', 
'__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__', 
'__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', 
'__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__']

And another 34 ASCII strings.

So to get access to just *one* method of *one* class of *one* module, we 
have already seen up to 144 ASCII strings. (Some of them will be 
duplicated.)

Even if every one of *your* classes, methods, functions, modules and 
variables are using non-ASCII names, you will still use ASCII strings for 
built-in functions and standard library modules.

> What should a Python user think, if he sees his strings are comsuming
> more memory just because he uses non ascii characters

WRONG!

His strings are consuming just as much memory as they need to. You cannot 
fit ten thousand different characters into a single byte. A single byte 
can represent only 2**8 = 256 characters. Two bytes can only represent 
65536 characters at most. Four bytes can represent the entire range of 
every character ever represented in human history, and more, but it is 
terribly wasteful: most strings do not use a billion different 
characters, and so use of a four-byte character encoding uses up to four 
times as much memory as necessary.

You are imagining that non-ASCII users are being discriminated against, 
with their strings being unfairly bloated. But that is not the case. 
Their strings would be equally large in a Python wide-build, give or take 
whatever overhead of the string object that change from version to 
version. If you are not comparing a wide-build of Python to Python 3.3, 
then your comparison is faulty. You are comparing "buggy Unicode, cannot 
handle the supplementary planes" with "fixed Unicode, can handle the 
supplementary planes". Python 3.2 narrow builds save memory by 
introducing bugs into Unicode strings. Python 3.3 fixes those bugs and 
still saves memory.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#35225

From	wxjmfauth@gmail.com
Date	2012-12-20 11:40 -0800
Message-ID	<mailman.1110.1356037281.29569.python-list@python.org>
In reply to	#35160

Le mercredi 19 décembre 2012 22:31:42 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 2:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > latin-1 (iso-8859-1) ? are you sure ?
> 
> 
> 
> Yes.
> 
> 
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('ab')
> 
> > 27
> 
> >>>> sys.getsizeof('aé')
> 
> > 39
> 
> 
> 
> Compare to:
> 
> 
> 
> >>> sys.getsizeof('a\u0100')
> 
> 42
> 
> 
> 
> The reason for the difference you posted is that pure ASCII strings
> 
> have a further optimization, which I glossed over and which is purely
> 
> a savings in overhead:
> 
> 
> 
> >>> sys.getsizeof('abcde') - sys.getsizeof('a')
> 
> 4
> 
> >>> sys.getsizeof('ábçdê') - sys.getsizeof('á')
> 
> 4

-----

I know all of this. And this is exactly, what I explained.
I do not care about this optimization. I'm not an ascii user.
As a non ascii user, this optimization is just irrelevant.

What should a Python user think, if he sees his strings
are comsuming more memory just because he uses non ascii
characters or he sees his strings are changing just because
he "uppercases" them.
Unicode is here to serve anybody.

jmf

[toc] | [prev] | [next] | [standalone]

#35158

From	wxjmfauth@gmail.com
Date	2012-12-19 13:18 -0800
Message-ID	<mailman.1073.1355951888.29569.python-list@python.org>
In reply to	#35147

Le mercredi 19 décembre 2012 19:27:38 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <rosuav@gmail.com> wrote:
> 
> > You may not be familiar with jmf. He's one of our resident trolls, and
> 
> > he has a bee in his bonnet about PEP 393 strings, on the basis that
> 
> > they take up more space in memory than a narrow build of Python 3.2
> 
> > would, for a string with lots of BMP characters and one non-BMP. In
> 
> > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> 
> > pairs* for non-BMP characters. This means that len() counts them
> 
> > twice, as does string indexing/slicing. That's a major bug, especially
> 
> > as your Python code will do different things on different platforms -
> 
> > most Linux builds of 3.2 are "wide" builds, storing characters in four
> 
> > bytes each.
> 
> 
> 
> >From what I've been able to discern, his actual complaint about PEP
> 
> 393 stems from misguided moral concerns.  With PEP-393, strings that
> 
> can be fully represented in Latin-1 can be stored in half the space
> 
> (ignoring fixed overhead) compared to strings containing at least one
> 
> non-Latin-1 character.  jmf thinks this optimization is unfair to
> 
> non-English users and immoral; he wants Latin-1 strings to be treated
> 
> exactly like non-Latin-1 strings (I don't think he actually cares
> 
> about non-BMP strings at all; if narrow-build Unicode is good enough
> 
> for him, then it must be good enough for everybody).  Unfortunately
> 
> for him, the Latin-1 optimization is rather trivial in the wider
> 
> context of PEP-393, and simply removing that part alone clearly
> 
> wouldn't be doing anybody any favors.  So for him to get what he
> 
> wants, the entire PEP has to go.
> 
> 
> 
> It's rather like trying to solve the problem of wealth disparity by
> 
> forcing everyone to dump their excess wealth into the ocean.

----

latin-1 (iso-8859-1) ? are you sure ?

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('aé')
39

Time to go to bed. More complete answer tomorrow.

jmf

[toc] | [prev] | [next] | [standalone]

#35169

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-19 19:39 -0500
Message-ID	<mailman.1081.1355963989.29569.python-list@python.org>
In reply to	#35130

On 12/19/2012 10:40 AM, Chris Angelico wrote:

> Interestingly, IDLE on my Windows box can't handle the bolded
> characters very well...
>
>>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>>> print(s)
> Traceback (most recent call last):
>    File "<pyshell#2>", line 1, in <module>
>      print(s)
> UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> in position 0: Non-BMP character not supported in Tk

On 3.3.0 on Win7 , the expressions 's', 'repr(s)', and 'str(s)' (without 
the quotes) echo the input as entered (with \U escapes) while 'print(s)' 
gets the same traceback you did.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35173

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-20 13:03 +1100
Message-ID	<mailman.1084.1355969013.29569.python-list@python.org>
In reply to	#35130

On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> From what I've been able to discern, [jmf's] actual complaint about PEP
> 393 stems from misguided moral concerns.  With PEP-393, strings that
> can be fully represented in Latin-1 can be stored in half the space
> (ignoring fixed overhead) compared to strings containing at least one
> non-Latin-1 character.  jmf thinks this optimization is unfair to
> non-English users and immoral; he wants Latin-1 strings to be treated
> exactly like non-Latin-1 strings (I don't think he actually cares
> about non-BMP strings at all; if narrow-build Unicode is good enough
> for him, then it must be good enough for everybody).

Not entirely; most of his complaints are based on performance (speed
and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
edge cases to prove how much worse 3.3 is, while utterly ignoring the
fact that, in those self-same edge cases, 3.2 is buggy.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35177

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-19 21:54 -0500
Message-ID	<mailman.1088.1355972082.29569.python-list@python.org>
In reply to	#35130

On 12/19/2012 9:03 PM, Chris Angelico wrote:
> On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>>  From what I've been able to discern, [jmf's] actual complaint about PEP
>> 393 stems from misguided moral concerns.  With PEP-393, strings that
>> can be fully represented in Latin-1 can be stored in half the space
>> (ignoring fixed overhead) compared to strings containing at least one
>> non-Latin-1 character.  jmf thinks this optimization is unfair to
>> non-English users and immoral; he wants Latin-1 strings to be treated
>> exactly like non-Latin-1 strings (I don't think he actually cares
>> about non-BMP strings at all; if narrow-build Unicode is good enough
>> for him, then it must be good enough for everybody).
>
> Not entirely; most of his complaints are based on performance (speed
> and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
> edge cases to prove how much worse 3.3 is, while utterly ignoring the
> fact that, in those self-same edge cases, 3.2 is buggy.

And the fact that stringbench.py is overall about as fast with 3.3 as 
with 3.2 *on the same Windows 7 machine* (which uses narrow build in 
3.2), and that unicode operations are not far from bytes operations when 
the same thing can be done with both.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35178

From	Westley Martínez <anikom15@gmail.com>
Date	2012-12-19 19:12 -0800
Message-ID	<mailman.1089.1355973157.29569.python-list@python.org>
In reply to	#35130

On Wed, Dec 19, 2012 at 09:54:20PM -0500, Terry Reedy wrote:
> On 12/19/2012 9:03 PM, Chris Angelico wrote:
> >On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> >> From what I've been able to discern, [jmf's] actual complaint about PEP
> >>393 stems from misguided moral concerns.  With PEP-393, strings that
> >>can be fully represented in Latin-1 can be stored in half the space
> >>(ignoring fixed overhead) compared to strings containing at least one
> >>non-Latin-1 character.  jmf thinks this optimization is unfair to
> >>non-English users and immoral; he wants Latin-1 strings to be treated
> >>exactly like non-Latin-1 strings (I don't think he actually cares
> >>about non-BMP strings at all; if narrow-build Unicode is good enough
> >>for him, then it must be good enough for everybody).
> >
> >Not entirely; most of his complaints are based on performance (speed
> >and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
> >edge cases to prove how much worse 3.3 is, while utterly ignoring the
> >fact that, in those self-same edge cases, 3.2 is buggy.
> 
> And the fact that stringbench.py is overall about as fast with 3.3
> as with 3.2 *on the same Windows 7 machine* (which uses narrow build
> in 3.2), and that unicode operations are not far from bytes
> operations when the same thing can be done with both.
> 
> -- 
> Terry Jan Reedy

Really, why should we be so obsessed with speed anyways?  Isn't
improving the language and fixing bugs far more important?

[toc] | [prev] | [next] | [standalone]

#35179

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-20 14:22 +1100
Message-ID	<mailman.1090.1355973763.29569.python-list@python.org>
In reply to	#35130

On Thu, Dec 20, 2012 at 2:12 PM, Westley Martínez <anikom15@gmail.com> wrote:
> Really, why should we be so obsessed with speed anyways?  Isn't
> improving the language and fixing bugs far more important?

Because speed is very important in certain areas. Python can be used
in many ways:

* Command-line calculator with awesome precision and variable handling
* Proglets, written once and run once, doing one simple job and then moving on
* Applications that do heaps of work and are run multiple times a day
* Internet services (eg web server), contacted many times a second
* Etcetera
* Etcetera
* And quite a few other ways too

For the first two, performance isn't very important. No matter how
slow the language, it's still going to respond "3" instantly when you
enter "1+2", and unless you're writing something hopelessly
inefficient or brute-force, the time spent writing a proglet usually
dwarfs its execution time.

But performance is very important for something like Mercurial, which
is invoked many times and always with the user waiting for it. You
want to get back to work, not sit there for X seconds while your
source control engine fires up and does something. And with a web
server, language performance translates fairly directly into latency
AND potential requests per second on any given hardware.

To be sure, a lot of Python performance hits the level of "sufficient"
and doesn't need to go further, but it's still worth considering.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35180

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-20 00:32 -0500
Message-ID	<mailman.1091.1355981588.29569.python-list@python.org>
In reply to	#35130

On 12/19/2012 10:12 PM, Westley Martínez wrote:
> On Wed, Dec 19, 2012 at 09:54:20PM -0500, Terry Reedy wrote:
>> On 12/19/2012 9:03 PM, Chris Angelico wrote:
>>> On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
>>>>  From what I've been able to discern, [jmf's] actual complaint about PEP
>>>> 393 stems from misguided moral concerns.  With PEP-393, strings that
>>>> can be fully represented in Latin-1 can be stored in half the space
>>>> (ignoring fixed overhead) compared to strings containing at least one
>>>> non-Latin-1 character.  jmf thinks this optimization is unfair to
>>>> non-English users and immoral; he wants Latin-1 strings to be treated
>>>> exactly like non-Latin-1 strings (I don't think he actually cares
>>>> about non-BMP strings at all; if narrow-build Unicode is good enough
>>>> for him, then it must be good enough for everybody).
>>>
>>> Not entirely; most of his complaints are based on performance (speed
>>> and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
>>> edge cases to prove how much worse 3.3 is, while utterly ignoring the
>>> fact that, in those self-same edge cases, 3.2 is buggy.
>>
>> And the fact that stringbench.py is overall about as fast with 3.3
>> as with 3.2 *on the same Windows 7 machine* (which uses narrow build
>> in 3.2), and that unicode operations are not far from bytes
>> operations when the same thing can be done with both.
>>
>> --
>> Terry Jan Reedy
>
> Really, why should we be so obsessed with speed anyways?  Isn't
> improving the language and fixing bugs far more important?

Being conservative, there are probably at least 10 enhancement patches 
and 30 bug fix patches for every performance patch. Performance patches 
are considered enhancements and only go in new versions with 
enhancements, where they go through the extended alpha, beta, candidate 
test and evaluation process.

In the unicode case, Jim discovered that find was several times slower 
in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran 
the complete stringbency.py and discovered that find (and consequently 
find and replace) are the only operations with such a slowdown. I also 
discovered that another at least as common operation, encoding strings 
that only contain ascii characters to ascii bytes for transmission, is 
several times as fast in 3.3. So I reported that unless one is only 
finding substrings in long strings, there is no reason to not upgrade to 
3.3.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35181

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-12-20 05:51 +0000
Message-ID	<50d2a773$0$29863$c3e8da3$5496439d@news.astraweb.com>
In reply to	#35180

On Thu, 20 Dec 2012 00:32:42 -0500, Terry Reedy wrote:

> In the unicode case, Jim discovered that find was several times slower
> in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran
> the complete stringbency.py and discovered that find (and consequently
> find and replace) are the only operations with such a slowdown. I also
> discovered that another at least as common operation, encoding strings
> that only contain ascii characters to ascii bytes for transmission, is
> several times as fast in 3.3. So I reported that unless one is only
> finding substrings in long strings, there is no reason to not upgrade to
> 3.3.

Yes, and if you remember, Jim (jfm) based his complaints on very possibly 
the worst edge-case for the new Unicode implementation:

- generate a large string of characters
- replace every character in that string with another character

By memory:

s = "a"*100000
s = s.replace("a", "b")

or equivalent. Hardly representative of normal string processing, and 
likely to be the worst-performing operation on new Unicode strings. And 
yet even so, many people reported either a mild slow down or, in a few 
cases, a small speed up.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#35216

From	wxjmfauth@gmail.com
Date	2012-12-20 11:57 -0800
Message-ID	<43140393-080f-4dad-98e3-c9f27acd9490@googlegroups.com>
In reply to	#35180

Le jeudi 20 décembre 2012 06:32:42 UTC+1, Terry Reedy a écrit :
> On 12/19/2012 10:12 PM, Westley Martínez wrote:
> 
> > On Wed, Dec 19, 2012 at 09:54:20PM -0500, Terry Reedy wrote:
> 
> >> On 12/19/2012 9:03 PM, Chris Angelico wrote:
> 
> >>> On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> 
> >>>>  From what I've been able to discern, [jmf's] actual complaint about PEP
> 
> >>>> 393 stems from misguided moral concerns.  With PEP-393, strings that
> 
> >>>> can be fully represented in Latin-1 can be stored in half the space
> 
> >>>> (ignoring fixed overhead) compared to strings containing at least one
> 
> >>>> non-Latin-1 character.  jmf thinks this optimization is unfair to
> 
> >>>> non-English users and immoral; he wants Latin-1 strings to be treated
> 
> >>>> exactly like non-Latin-1 strings (I don't think he actually cares
> 
> >>>> about non-BMP strings at all; if narrow-build Unicode is good enough
> 
> >>>> for him, then it must be good enough for everybody).
> 
> >>>
> 
> >>> Not entirely; most of his complaints are based on performance (speed
> 
> >>> and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
> 
> >>> edge cases to prove how much worse 3.3 is, while utterly ignoring the
> 
> >>> fact that, in those self-same edge cases, 3.2 is buggy.
> 
> >>
> 
> >> And the fact that stringbench.py is overall about as fast with 3.3
> 
> >> as with 3.2 *on the same Windows 7 machine* (which uses narrow build
> 
> >> in 3.2), and that unicode operations are not far from bytes
> 
> >> operations when the same thing can be done with both.
> 
> >>
> 
> >> --
> 
> >> Terry Jan Reedy
> 
> >
> 
> > Really, why should we be so obsessed with speed anyways?  Isn't
> 
> > improving the language and fixing bugs far more important?
> 
> 
> 
> Being conservative, there are probably at least 10 enhancement patches 
> 
> and 30 bug fix patches for every performance patch. Performance patches 
> 
> are considered enhancements and only go in new versions with 
> 
> enhancements, where they go through the extended alpha, beta, candidate 
> 
> test and evaluation process.
> 
> 
> 
> In the unicode case, Jim discovered that find was several times slower 
> 
> in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran 
> 
> the complete stringbency.py and discovered that find (and consequently 
> 
> find and replace) are the only operations with such a slowdown. I also 
> 
> discovered that another at least as common operation, encoding strings 
> 
> that only contain ascii characters to ascii bytes for transmission, is 
> 
> several times as fast in 3.3. So I reported that unless one is only 
> 
> finding substrings in long strings, there is no reason to not upgrade to 
> 
> 3.3.
> 
> 
> 
> -- 
> 
> Terry Jan Reedy

--------

I shew a case where the Py33 works 10 times slower than Py32, 
"replace". You the devs spend your time to correct that case.

Now, if I'm putting on the table an exemple working 20 times
slower. Will you spend your time to optimize that?

I'm affraid, this is the FSR which is problematic, not the
corner cases.

jmf

[toc] | [prev] | [next] | [standalone]

#35236

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-20 17:30 -0500
Message-ID	<mailman.1117.1356042659.29569.python-list@python.org>
In reply to	#35216

On 12/20/2012 2:57 PM, wxjmfauth@gmail.com wrote:

> I shew a case where the Py33 works 10 times slower than Py32,
> "replace". You the devs spend your time to correct that case.

I discovered that it is the 'find' part of find and replace that is 
slower. The comparison is worse on Windows than on *nix. There is an 
issue on the tracker so it may be improved someday. Most devs are not 
especially bothered and would rather fix errors as part of their 
volunteer work.

> Now, if I'm putting on the table an exemple working 20 times
> slower. Will you spend your time to optimize that?
>
> I'm affraid, this is the FSR which is problematic, not the
> corner cases.

I showed another case where 3.3 is a thousand, a million times faster 
than 3.2. Does that make the old way 'problematic'?

Don't you think that the bugs (wrong answers) in narrow builds to be 
'problematic'? Do you really think that getting wrong answers faster is 
better that getting right answers possibly slower?

The 'find' operation is just 1 of about 30 that are tested by 
stringbench.py. Run that on 3.3 and 3.2, as I did, before talking about 
FSR as 'problematic'.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35217

From	wxjmfauth@gmail.com
Date	2012-12-20 11:57 -0800
Message-ID	<mailman.1106.1356034079.29569.python-list@python.org>
In reply to	#35180

Le jeudi 20 décembre 2012 06:32:42 UTC+1, Terry Reedy a écrit :
> On 12/19/2012 10:12 PM, Westley Martínez wrote:
> 
> > On Wed, Dec 19, 2012 at 09:54:20PM -0500, Terry Reedy wrote:
> 
> >> On 12/19/2012 9:03 PM, Chris Angelico wrote:
> 
> >>> On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> 
> >>>>  From what I've been able to discern, [jmf's] actual complaint about PEP
> 
> >>>> 393 stems from misguided moral concerns.  With PEP-393, strings that
> 
> >>>> can be fully represented in Latin-1 can be stored in half the space
> 
> >>>> (ignoring fixed overhead) compared to strings containing at least one
> 
> >>>> non-Latin-1 character.  jmf thinks this optimization is unfair to
> 
> >>>> non-English users and immoral; he wants Latin-1 strings to be treated
> 
> >>>> exactly like non-Latin-1 strings (I don't think he actually cares
> 
> >>>> about non-BMP strings at all; if narrow-build Unicode is good enough
> 
> >>>> for him, then it must be good enough for everybody).
> 
> >>>
> 
> >>> Not entirely; most of his complaints are based on performance (speed
> 
> >>> and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
> 
> >>> edge cases to prove how much worse 3.3 is, while utterly ignoring the
> 
> >>> fact that, in those self-same edge cases, 3.2 is buggy.
> 
> >>
> 
> >> And the fact that stringbench.py is overall about as fast with 3.3
> 
> >> as with 3.2 *on the same Windows 7 machine* (which uses narrow build
> 
> >> in 3.2), and that unicode operations are not far from bytes
> 
> >> operations when the same thing can be done with both.
> 
> >>
> 
> >> --
> 
> >> Terry Jan Reedy
> 
> >
> 
> > Really, why should we be so obsessed with speed anyways?  Isn't
> 
> > improving the language and fixing bugs far more important?
> 
> 
> 
> Being conservative, there are probably at least 10 enhancement patches 
> 
> and 30 bug fix patches for every performance patch. Performance patches 
> 
> are considered enhancements and only go in new versions with 
> 
> enhancements, where they go through the extended alpha, beta, candidate 
> 
> test and evaluation process.
> 
> 
> 
> In the unicode case, Jim discovered that find was several times slower 
> 
> in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran 
> 
> the complete stringbency.py and discovered that find (and consequently 
> 
> find and replace) are the only operations with such a slowdown. I also 
> 
> discovered that another at least as common operation, encoding strings 
> 
> that only contain ascii characters to ascii bytes for transmission, is 
> 
> several times as fast in 3.3. So I reported that unless one is only 
> 
> finding substrings in long strings, there is no reason to not upgrade to 
> 
> 3.3.
> 
> 
> 
> -- 
> 
> Terry Jan Reedy

--------

I shew a case where the Py33 works 10 times slower than Py32, 
"replace". You the devs spend your time to correct that case.

Now, if I'm putting on the table an exemple working 20 times
slower. Will you spend your time to optimize that?

I'm affraid, this is the FSR which is problematic, not the
corner cases.

jmf

[toc] | [prev] | [next] | [standalone]

#35633

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-12-27 21:00 +0200
Message-ID	<mailman.1354.1356634864.29569.python-list@python.org>
In reply to	#35130

On 19.12.12 17:40, Chris Angelico wrote:
> Interestingly, IDLE on my Windows box can't handle the bolded
> characters very well...
>
>>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>>> print(s)
> Traceback (most recent call last):
>    File "<pyshell#2>", line 1, in <module>
>      print(s)
> UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> in position 0: Non-BMP character not supported in Tk
>
> I think this is most likely a case of "yeah, Windows XP just sucks".
> But I have no reason or inclination to get myself a newer Windows to
> find out if it's any different.

No, this is a Tcl/Tk limitation (I don't know if this was fixed in 8.6).

[toc] | [prev] | [next] | [standalone]

#35637

From	wxjmfauth@gmail.com
Date	2012-12-27 11:36 -0800
Message-ID	<9e40c8de-d4cf-4a64-800d-97caa399bc0a@googlegroups.com>
In reply to	#35633

Le jeudi 27 décembre 2012 20:00:37 UTC+1, Serhiy Storchaka a écrit :
> On 19.12.12 17:40, Chris Angelico wrote:
> 
> > Interestingly, IDLE on my Windows box can't handle the bolded
> 
> > characters very well...
> 
> >
> 
> >>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
> 
> >>>> print(s)
> 
> > Traceback (most recent call last):
> 
> >    File "<pyshell#2>", line 1, in <module>
> 
> >      print(s)
> 
> > UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> 
> > in position 0: Non-BMP character not supported in Tk
> 
> >
> 
> > I think this is most likely a case of "yeah, Windows XP just sucks".
> 
> > But I have no reason or inclination to get myself a newer Windows to
> 
> > find out if it's any different.
> 
> 
> 
> No, this is a Tcl/Tk limitation (I don't know if this was fixed in 8.6).

-----


This is a strange error message. Remember: a coding scheme
covers a *set of characters*.
The guilty code point corresponds to a character which
is not part of the ucs-2 characters set!

jmf

[toc] | [prev] | [next] | [standalone]

#35639

From	wxjmfauth@gmail.com
Date	2012-12-27 11:36 -0800
Message-ID	<mailman.1358.1356637017.29569.python-list@python.org>
In reply to	#35633

Le jeudi 27 décembre 2012 20:00:37 UTC+1, Serhiy Storchaka a écrit :
> On 19.12.12 17:40, Chris Angelico wrote:
> 
> > Interestingly, IDLE on my Windows box can't handle the bolded
> 
> > characters very well...
> 
> >
> 
> >>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
> 
> >>>> print(s)
> 
> > Traceback (most recent call last):
> 
> >    File "<pyshell#2>", line 1, in <module>
> 
> >      print(s)
> 
> > UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> 
> > in position 0: Non-BMP character not supported in Tk
> 
> >
> 
> > I think this is most likely a case of "yeah, Windows XP just sucks".
> 
> > But I have no reason or inclination to get myself a newer Windows to
> 
> > find out if it's any different.
> 
> 
> 
> No, this is a Tcl/Tk limitation (I don't know if this was fixed in 8.6).

-----


This is a strange error message. Remember: a coding scheme
covers a *set of characters*.
The guilty code point corresponds to a character which
is not part of the ucs-2 characters set!

jmf

[toc] | [prev] | [next] | [standalone]

#35133

From	Christian Heimes <christian@python.org>
Date	2012-12-19 16:33 +0100
Message-ID	<mailman.1056.1355931232.29569.python-list@python.org>
In reply to	#35115

Am 19.12.2012 16:01, schrieb Stefan Krah:
> The uppercase ß isn't really needed, since ß does not occur at the beginning
> of a word. As far as I know, most Germans wouldn't even know that it has
> existed at some point or how to write it.

I think Python 3.3+ is using uppercase mapping (uc) instead of simple
upper case (suc).


Some background:

The old German Fractur has three variants of the letter S:

 capital s: S
 long s: ſ
 round s: s.

ß is a ligature of ſs. ſ is usually used at the beginning or middle of a
syllable while s is used at the end of a syllable. Compare Wachſtube
(Wach-Stube == guard room) to Wachstube (Wachs-Tube == tube of wax). :)

Christian

[toc] | [prev] | [next] | [standalone]

#35767

From	wxjmfauth@gmail.com
Date	2012-12-29 11:16 -0800
Message-ID	<c3556bf7-994a-4050-aa2a-461fe362d53f@googlegroups.com>
In reply to	#35133

Le mercredi 19 décembre 2012 16:33:50 UTC+1, Christian Heimes a écrit :
> 
> I think Python 3.3+ is using uppercase mapping (uc) instead of simple
> 
> upper case (suc).

I think you are thinking correctly. This a clever answer.

Note: I do not care about the uc / suc choice. As long
there is consistency, I'm fine with the choice. Anyway, the
only valid "programming technique" on that field is to create
a dedicated lib for a given script (esp. French!)

jmf



> 
> 
> 
> 
> 
> Some background:
> 
> 
> 
> The old German Fractur has three variants of the letter S:
> 
> 
> 
>  capital s: S
> 
>  long s: ſ
> 
>  round s: s.
> 
> 
> 
> ß is a ligature of ſs. ſ is usually used at the beginning or middle of a
> 
> syllable while s is used at the end of a syllable. Compare Wachſtube
> 
> (Wach-Stube == guard room) to Wachstube (Wachs-Tube == tube of wax). :)
> 
> 
> 
> Christian

[toc] | [prev] | [next] | [standalone]

#35786

From	wxjmfauth@gmail.com
Date	2012-12-29 11:16 -0800
Message-ID	<mailman.1449.1356819129.29569.python-list@python.org>
In reply to	#35133

Le mercredi 19 décembre 2012 16:33:50 UTC+1, Christian Heimes a écrit :
> 
> I think Python 3.3+ is using uppercase mapping (uc) instead of simple
> 
> upper case (suc).

I think you are thinking correctly. This a clever answer.

Note: I do not care about the uc / suc choice. As long
there is consistency, I'm fine with the choice. Anyway, the
only valid "programming technique" on that field is to create
a dedicated lib for a given script (esp. French!)

jmf



> 
> 
> 
> 
> 
> Some background:
> 
> 
> 
> The old German Fractur has three variants of the letter S:
> 
> 
> 
>  capital s: S
> 
>  long s: ſ
> 
>  round s: s.
> 
> 
> 
> ß is a ligature of ſs. ſ is usually used at the beginning or middle of a
> 
> syllable while s is used at the end of a syllable. Compare Wachſtube
> 
> (Wach-Stube == guard room) to Wachstube (Wachs-Tube == tube of wax). :)
> 
> 
> 
> Christian

[toc] | [prev] | [next] | [standalone]

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

csiph-web

Py 3.3, unicode / upper()

Contents

#35237

#35238

#35225

#35158

#35169

#35173

#35177

#35178

#35179

#35180

#35181

#35216

#35236

#35217

#35633

#35637

#35639

#35133

#35767

#35786