Groups > comp.lang.python > #35115 > unrolled thread

Py 3.3, unicode / upper()

Started by	wxjmfauth@gmail.com
First post	2012-12-19 06:23 -0800
Last post	2012-12-20 17:34 -0700
Articles	20 on this page of 47 — 13 participants

Back to article view | Back to comp.lang.python

  Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 06:23 -0800
    Re: Py 3.3, unicode / upper() Thomas Bach <thbach@students.uni-mainz.de> - 2012-12-19 15:43 +0100
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 15:52 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
        Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:23 -0700
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
        Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:01 +1100
        Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 18:53 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
    Re: Py 3.3, unicode / upper() Stefan Krah <stefan-usenet@bytereef.org> - 2012-12-19 16:01 +0100
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:17 +1100
    Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:18 +0100
      Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:22 +0100
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:40 +1100
        Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-20 15:57 +0100
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 11:27 -0700
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
          Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:31 -0700
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
              Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:48 -0500
              Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 22:51 +0000
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 19:39 -0500
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:03 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 21:54 -0500
      Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 19:12 -0800
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 14:22 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 00:32 -0500
        Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 05:51 +0000
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
          Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:30 -0500
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
      Re: Py 3.3, unicode / upper() Serhiy Storchaka <storchaka@gmail.com> - 2012-12-27 21:00 +0200
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 16:33 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
    Re: Py 3.3, unicode / upper() Benjamin Peterson <benjamin@python.org> - 2012-12-19 20:25 +0000
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:19 -0800
      Re: Py 3.3, unicode / upper() MRAB <python@mrabarnett.plus.com> - 2012-12-20 20:20 +0000
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-21 08:19 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:12 -0500
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:59 -0500
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-20 17:34 -0700

Page 1 of 3 [1] 2 3 Next page →

#35115 — Py 3.3, unicode / upper()

From	wxjmfauth@gmail.com
Date	2012-12-19 06:23 -0800
Subject	Py 3.3, unicode / upper()
Message-ID	<2adb4a25-8ea3-441f-b8c0-ee6c87e4b19f@googlegroups.com>

I was using the German word "Straße" (Strasse) — German
translation from "street" — to illustrate the catastrophic and 
completely wrong-by-design Unicode handling in Py3.3, this
time from a memory point of view (not speed):

>>> sys.getsizeof('Straße')
43
>>> sys.getsizeof('STRAẞE')
50

instead of a sane (Py3.2)

>>> sys.getsizeof('Straße')
42
>>> sys.getsizeof('STRAẞE')
42


But, this is not the problem.
I was suprised to discover this:

>>> 'Straße'.upper()
'STRASSE'

I really, really do not know what I should think about that.
(It is a complex subject.) And the real question is why?

jmf

[toc] | [next] | [standalone]

#35124

From	Thomas Bach <thbach@students.uni-mainz.de>
Date	2012-12-19 15:43 +0100
Message-ID	<mailman.1050.1355928242.29569.python-list@python.org>
In reply to	#35115

On Wed, Dec 19, 2012 at 06:23:00AM -0800, wxjmfauth@gmail.com wrote:
> I was suprised to discover this:
> 
> >>> 'Straße'.upper()
> 'STRASSE'
> 
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because there is no definition for upper-case 'ß'. 'SS' is used as the
common replacement in this case. I think it's pretty smart! :)

Regards,
	Thomas.

[toc] | [prev] | [next] | [standalone]

#35125

From	Christian Heimes <christian@python.org>
Date	2012-12-19 15:52 +0100
Message-ID	<mailman.1051.1355928746.29569.python-list@python.org>
In reply to	#35115

Am 19.12.2012 15:23, schrieb wxjmfauth@gmail.com:
> But, this is not the problem.
> I was suprised to discover this:
> 
>>>> 'Straße'.upper()
> 'STRASSE'
> 
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
form. However the unicode database specifies an upper case mapping from
ß to SS. http://codepoints.net/U+00DF

Christian

[toc] | [prev] | [next] | [standalone]

#35155

From	wxjmfauth@gmail.com
Date	2012-12-19 12:55 -0800
Message-ID	<890ee58d-e93e-42ac-b17e-59b05c6ecacb@googlegroups.com>
In reply to	#35125

Le mercredi 19 décembre 2012 15:52:23 UTC+1, Christian Heimes a écrit :
> Am 19.12.2012 15:23, schrieb wxjmfauth@gmail.com:
> 
> > But, this is not the problem.
> 
> > I was suprised to discover this:
> 
> > 
> 
> >>>> 'Straße'.upper()
> 
> > 'STRASSE'
> 
> > 
> 
> > I really, really do not know what I should think about that.
> 
> > (It is a complex subject.) And the real question is why?
> 
> 
> 
> It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
> 
> form. However the unicode database specifies an upper case mapping from
> 
> ß to SS. http://codepoints.net/U+00DF
> 
> 
> 
> Christian

-----

Yes, it is correct (or can be considered as correct).
I do not wish to discuss the typographical problematic
of "Das Grosse Eszett". The web is full of pages on the
subject. However, I never succeeded to find an "official
position" from Unicode. The best information I found seem
to indicate (to converge), U+1E9E is now the "supported"
uppercase form of U+00DF. (see DIN).

What is bothering me, is more the implementation. The Unicode
documentation says roughly this: if something can not be
honoured, there is no harm, but do not implement a workaroud.
In that case, I'm not sure Python is doing the best.

If "wrong", this can be considered as programmatically correct
or logically acceptable (Py3.2)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
True

while this will *always* be problematic (Py3.3)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
False

jmf

[toc] | [prev] | [next] | [standalone]

#35159

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-12-19 14:23 -0700
Message-ID	<mailman.1074.1355952227.29569.python-list@python.org>
In reply to	#35155

On Wed, Dec 19, 2012 at 1:55 PM,  <wxjmfauth@gmail.com> wrote:
> Yes, it is correct (or can be considered as correct).
> I do not wish to discuss the typographical problematic
> of "Das Grosse Eszett". The web is full of pages on the
> subject. However, I never succeeded to find an "official
> position" from Unicode. The best information I found seem
> to indicate (to converge), U+1E9E is now the "supported"
> uppercase form of U+00DF. (see DIN).

Is this link not official?

http://unicode.org/cldr/utility/character.jsp?a=00DF

That defines a full uppercase mapping to SS and a simple uppercase
mapping to U+00DF itself, not U+1E9E.  My understanding of the simple
mapping is that it is not allowed to map to multiple characters,
whereas the full mapping is so allowed.

> What is bothering me, is more the implementation. The Unicode
> documentation says roughly this: if something can not be
> honoured, there is no harm, but do not implement a workaroud.
> In that case, I'm not sure Python is doing the best.

But this behavior is per the specification, not a workaround.  I think
the worst thing we could do in this regard would be to start diverging
from the specification because we think we know better than the
Unicode Consortium.

> If "wrong", this can be considered as programmatically correct
> or logically acceptable (Py3.2)
>
>>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> True
>
> while this will *always* be problematic (Py3.3)
>
>>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> False

On the other hand (Py3.2):

>>> 'Straße'.upper().isupper()
False

vs. Py3.3:

>>> 'Straße'.upper().isupper()
True

There is probably no one clearly correct way to handle the problem,
but personally this contradiction bothers me more than the example
that you posted.

[toc] | [prev] | [next] | [standalone]

#35215

From	wxjmfauth@gmail.com
Date	2012-12-20 11:42 -0800
Message-ID	<e32e429f-82a2-4799-b403-5e3d2f1f35f6@googlegroups.com>
In reply to	#35159

Le mercredi 19 décembre 2012 22:23:15 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 1:55 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Yes, it is correct (or can be considered as correct).
> 
> > I do not wish to discuss the typographical problematic
> 
> > of "Das Grosse Eszett". The web is full of pages on the
> 
> > subject. However, I never succeeded to find an "official
> 
> > position" from Unicode. The best information I found seem
> 
> > to indicate (to converge), U+1E9E is now the "supported"
> 
> > uppercase form of U+00DF. (see DIN).
> 
> 
> 
> Is this link not official?
> 
> 
> 
> http://unicode.org/cldr/utility/character.jsp?a=00DF
> 
> 
> 
> That defines a full uppercase mapping to SS and a simple uppercase
> 
> mapping to U+00DF itself, not U+1E9E.  My understanding of the simple
> 
> mapping is that it is not allowed to map to multiple characters,
> 
> whereas the full mapping is so allowed.
> 
> 
> 
> > What is bothering me, is more the implementation. The Unicode
> 
> > documentation says roughly this: if something can not be
> 
> > honoured, there is no harm, but do not implement a workaroud.
> 
> > In that case, I'm not sure Python is doing the best.
> 
> 
> 
> But this behavior is per the specification, not a workaround.  I think
> 
> the worst thing we could do in this regard would be to start diverging
> 
> from the specification because we think we know better than the
> 
> Unicode Consortium.
> 
> 
> 
> 
> 
> > If "wrong", this can be considered as programmatically correct
> 
> > or logically acceptable (Py3.2)
> 
> >
> 
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> 
> > True
> 
> >
> 
> > while this will *always* be problematic (Py3.3)
> 
> >
> 
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> 
> > False
> 
> 
> 
> On the other hand (Py3.2):
> 
> 
> 
> >>> 'Straße'.upper().isupper()
> 
> False
> 
> 
> 
> vs. Py3.3:
> 
> 
> 
> >>> 'Straße'.upper().isupper()
> 
> True
> 
> 
> 
> There is probably no one clearly correct way to handle the problem,
> 
> but personally this contradiction bothers me more than the example
> 
> that you posted.

----

At least, we agree on the problematic of this very special case.

jmf

[toc] | [prev] | [next] | [standalone]

#35231

From	wxjmfauth@gmail.com
Date	2012-12-20 11:42 -0800
Message-ID	<mailman.1114.1356038437.29569.python-list@python.org>
In reply to	#35159

Le mercredi 19 décembre 2012 22:23:15 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 1:55 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Yes, it is correct (or can be considered as correct).
> 
> > I do not wish to discuss the typographical problematic
> 
> > of "Das Grosse Eszett". The web is full of pages on the
> 
> > subject. However, I never succeeded to find an "official
> 
> > position" from Unicode. The best information I found seem
> 
> > to indicate (to converge), U+1E9E is now the "supported"
> 
> > uppercase form of U+00DF. (see DIN).
> 
> 
> 
> Is this link not official?
> 
> 
> 
> http://unicode.org/cldr/utility/character.jsp?a=00DF
> 
> 
> 
> That defines a full uppercase mapping to SS and a simple uppercase
> 
> mapping to U+00DF itself, not U+1E9E.  My understanding of the simple
> 
> mapping is that it is not allowed to map to multiple characters,
> 
> whereas the full mapping is so allowed.
> 
> 
> 
> > What is bothering me, is more the implementation. The Unicode
> 
> > documentation says roughly this: if something can not be
> 
> > honoured, there is no harm, but do not implement a workaroud.
> 
> > In that case, I'm not sure Python is doing the best.
> 
> 
> 
> But this behavior is per the specification, not a workaround.  I think
> 
> the worst thing we could do in this regard would be to start diverging
> 
> from the specification because we think we know better than the
> 
> Unicode Consortium.
> 
> 
> 
> 
> 
> > If "wrong", this can be considered as programmatically correct
> 
> > or logically acceptable (Py3.2)
> 
> >
> 
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> 
> > True
> 
> >
> 
> > while this will *always* be problematic (Py3.3)
> 
> >
> 
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> 
> > False
> 
> 
> 
> On the other hand (Py3.2):
> 
> 
> 
> >>> 'Straße'.upper().isupper()
> 
> False
> 
> 
> 
> vs. Py3.3:
> 
> 
> 
> >>> 'Straße'.upper().isupper()
> 
> True
> 
> 
> 
> There is probably no one clearly correct way to handle the problem,
> 
> but personally this contradiction bothers me more than the example
> 
> that you posted.

----

At least, we agree on the problematic of this very special case.

jmf

[toc] | [prev] | [next] | [standalone]

#35172

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-20 13:01 +1100
Message-ID	<mailman.1083.1355968896.29569.python-list@python.org>
In reply to	#35155

On Thu, Dec 20, 2012 at 8:23 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Wed, Dec 19, 2012 at 1:55 PM,  <wxjmfauth@gmail.com> wrote:
>> Yes, it is correct (or can be considered as correct).
>> I do not wish to discuss the typographical problematic
>> of "Das Grosse Eszett". The web is full of pages on the
>> subject. However, I never succeeded to find an "official
>> position" from Unicode. The best information I found seem
>> to indicate (to converge), U+1E9E is now the "supported"
>> uppercase form of U+00DF. (see DIN).
>
> Is this link not official?
>
> http://unicode.org/cldr/utility/character.jsp?a=00DF
>
> That defines a full uppercase mapping to SS and a simple uppercase
> mapping to U+00DF itself, not U+1E9E.  My understanding of the simple
> mapping is that it is not allowed to map to multiple characters,
> whereas the full mapping is so allowed.

Ahh, thanks, that explains why the other Unicode-aware language I
tried behaved differently.

Pike v7.9 release 5 running Hilfe v3.5 (Incremental Pike Frontend)
> string s="Stra\u00dfe";
> upper_case(s);
(1) Result: "STRA\337E"
> lower_case(upper_case(s));
(2) Result: "stra\337e"
> String.capitalize(lower_case(s));
(3) Result: "Stra\337e"

The output is the equivalent of repr(), and it uses octal escapes
where possible (for brevity), so \337 is its representation of U+00DF
(decimal 223, octal 337). Upper-casing and lower-casing this character
result in the same thing.

> write("Original: %s\nLower: %s\nUpper: %s\n",s,lower_case(s),upper_case(s));
Original: Straße
Lower: straße
Upper: STRAßE

It's worth noting, incidentally, that the unusual upper-case form of
the letter (U+1E9E) does lower-case to U+00DF in both Python 3.3 and
Pike 7.9.5:

> lower_case("Stra\u1E9Ee");
(9) Result: "stra\337e"

>>> ord("\u1e9e".lower())
223

So both of them are behaving in a compliant manner, even though
they're not quite identical.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35176

From	Westley Martínez <anikom15@gmail.com>
Date	2012-12-19 18:53 -0800
Message-ID	<mailman.1087.1355972019.29569.python-list@python.org>
In reply to	#35155

On Wed, Dec 19, 2012 at 02:23:15PM -0700, Ian Kelly wrote:
> On Wed, Dec 19, 2012 at 1:55 PM,  <wxjmfauth@gmail.com> wrote: 
> > If "wrong", this can be considered as programmatically correct
> > or logically acceptable (Py3.2)
> >
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> > True
> >
> > while this will *always* be problematic (Py3.3)
> >
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> > False
> 
> On the other hand (Py3.2):
> 
> >>> 'Straße'.upper().isupper()
> False
> 
> vs. Py3.3:
> 
> >>> 'Straße'.upper().isupper()
> True
> 
> There is probably no one clearly correct way to handle the problem,
> but personally this contradiction bothers me more than the example
> that you posted.

Why would it ever be wrong for 'Straße' to not equal 'Strasse'?  Python
is not intended to do any sort of advanced linguistic processing.  It is
comparing strings not words.  It is not problematic.  It makes sense.

[toc] | [prev] | [next] | [standalone]

#35156

From	wxjmfauth@gmail.com
Date	2012-12-19 12:55 -0800
Message-ID	<mailman.1072.1355950517.29569.python-list@python.org>
In reply to	#35125

Le mercredi 19 décembre 2012 15:52:23 UTC+1, Christian Heimes a écrit :
> Am 19.12.2012 15:23, schrieb wxjmfauth@gmail.com:
> 
> > But, this is not the problem.
> 
> > I was suprised to discover this:
> 
> > 
> 
> >>>> 'Straße'.upper()
> 
> > 'STRASSE'
> 
> > 
> 
> > I really, really do not know what I should think about that.
> 
> > (It is a complex subject.) And the real question is why?
> 
> 
> 
> It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
> 
> form. However the unicode database specifies an upper case mapping from
> 
> ß to SS. http://codepoints.net/U+00DF
> 
> 
> 
> Christian

-----

Yes, it is correct (or can be considered as correct).
I do not wish to discuss the typographical problematic
of "Das Grosse Eszett". The web is full of pages on the
subject. However, I never succeeded to find an "official
position" from Unicode. The best information I found seem
to indicate (to converge), U+1E9E is now the "supported"
uppercase form of U+00DF. (see DIN).

What is bothering me, is more the implementation. The Unicode
documentation says roughly this: if something can not be
honoured, there is no harm, but do not implement a workaroud.
In that case, I'm not sure Python is doing the best.

If "wrong", this can be considered as programmatically correct
or logically acceptable (Py3.2)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
True

while this will *always* be problematic (Py3.3)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
False

jmf

[toc] | [prev] | [next] | [standalone]

#35128

From	Stefan Krah <stefan-usenet@bytereef.org>
Date	2012-12-19 16:01 +0100
Message-ID	<mailman.1053.1355929700.29569.python-list@python.org>
In reply to	#35115

wxjmfauth@gmail.com <wxjmfauth@gmail.com> wrote:
> But, this is not the problem.
> I was suprised to discover this:
> 
> >>> 'Straße'.upper()
> 'STRASSE'
> 
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F#Versalsatz_ohne_gro.C3.9Fes_.C3.9F

"Die gegenwärtigen amtlichen Regeln[6] zur neuen deutschen Rechtschreibung
 kennen keinen Großbuchstaben zum ß: Jeder Buchstabe existiert als
 Kleinbuchstabe und als Großbuchstabe (Ausnahme ß). Im Versalsatz empfehlen
 die Regeln, das ß durch SS zu ersetzen: Bei Schreibung mit Großbuchstaben
 schreibt man SS, zum Beispiel: Straße -- STRASSE."

According to the new official spelling rules the uppercase ß does not exist.
The recommendation is to use "SS" when writing in all-caps.

As to why: It has always been acceptable to replace ß with "ss" when ß
wasn't part of a character set. In the new spelling rules, ß has been
officially replaced with "ss" in some cases:

http://en.wiktionary.org/wiki/da%C3%9F

The uppercase ß isn't really needed, since ß does not occur at the beginning
of a word. As far as I know, most Germans wouldn't even know that it has
existed at some point or how to write it.

Stefan Krah

[toc] | [prev] | [next] | [standalone]

#35129

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-20 02:17 +1100
Message-ID	<mailman.1054.1355930264.29569.python-list@python.org>
In reply to	#35115

On Thu, Dec 20, 2012 at 1:23 AM,  <wxjmfauth@gmail.com> wrote:
> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()
> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Not all strings can be uppercased and lowercased cleanly. Please stop
trotting out the old Box Hill-to-Camberwell arguments[1] yet again.

For comparison, try this string:

'𝐇𝐞𝐥𝐥𝐨, 𝐰𝐨𝐫𝐥𝐝!'.upper()

And while you're at it, check out sys.getsizeof() on that sort of
string, compare your beloved 3.2 on that. Oh, and also check out len()
on it.

[1] Melbourne's current ticketing system is based on zones, and
Camberwell is in zone 1, and Box Hill in zone 2. Detractors of public
transport point out that it costs far more to take the train from Box
Hill to Camberwell than it does to drive a car the same distance. It's
the same contrived example that keeps on getting trotted out time and
time again.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35130

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-12-19 16:18 +0100
Message-ID	<kaslsb$iue$1@news.albasani.net>
In reply to	#35115

On 19.12.2012 15:23, wxjmfauth@gmail.com wrote:
> I was using the German word "Straße" (Strasse) — German
> translation from "street" — to illustrate the catastrophic and 
> completely wrong-by-design Unicode handling in Py3.3, this
> time from a memory point of view (not speed):
> 
>>>> sys.getsizeof('Straße')
> 43
>>>> sys.getsizeof('STRAẞE')
> 50
> 
> instead of a sane (Py3.2)
> 
>>>> sys.getsizeof('Straße')
> 42
>>>> sys.getsizeof('STRAẞE')
> 42

How do those arbitrary numbers prove anything at all? Why do you draw
the conclusion that it's broken by design? What do you expect? You're
very vague here. Just to show how ridiculously pointless your numers
are, your example gives 84 on Python3.2 for any input of yours.

> But, this is not the problem.
> I was suprised to discover this:
> 
>>>> 'Straße'.upper()
> 'STRASSE'
> 
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because in the German language the uppercase "ß" is virtually dead.

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#35131

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-12-19 16:22 +0100
Message-ID	<kasm3a$iue$2@news.albasani.net>
In reply to	#35130

On 19.12.2012 16:18, Johannes Bauer wrote:

> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.

...on Python3.2 on MY system is what I meant to say (x86_64 Linux). Sorry.

Also, further reading:

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F
http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#35134

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-20 02:40 +1100
Message-ID	<mailman.1057.1355931653.29569.python-list@python.org>
In reply to	#35130

On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <dfnsonfsduifb@gmx.de> wrote:
> On 19.12.2012 15:23, wxjmfauth@gmail.com wrote:
>> I was using the German word "Straße" (Strasse) — German
>> translation from "street" — to illustrate the catastrophic and
>> completely wrong-by-design Unicode handling in Py3.3, this
>> time from a memory point of view (not speed):
>>
>>>>> sys.getsizeof('Straße')
>> 43
>>>>> sys.getsizeof('STRAẞE')
>> 50
>>
>> instead of a sane (Py3.2)
>>
>>>>> sys.getsizeof('Straße')
>> 42
>>>>> sys.getsizeof('STRAẞE')
>> 42
>
> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.

You may not be familiar with jmf. He's one of our resident trolls, and
he has a bee in his bonnet about PEP 393 strings, on the basis that
they take up more space in memory than a narrow build of Python 3.2
would, for a string with lots of BMP characters and one non-BMP. In
3.2 narrow builds, strings were stored in UTF-16, with *surrogate
pairs* for non-BMP characters. This means that len() counts them
twice, as does string indexing/slicing. That's a major bug, especially
as your Python code will do different things on different platforms -
most Linux builds of 3.2 are "wide" builds, storing characters in four
bytes each.

PEP 393 brings wide build semantics to all Pythons, while achieving
memory savings better than a narrow build can (with PEP 393 strings,
any all-ASCII or all-Latin-1 strings will be stored one byte per
character). Every now and then, though, jmf points out *yet again*
that his beloved and buggy narrow build consumes less memory and runs
faster than the oh so terrible 3.3 on some contrived example. It gets
rather tiresome.

Interestingly, IDLE on my Windows box can't handle the bolded
characters very well...

>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>> print(s)
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    print(s)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
in position 0: Non-BMP character not supported in Tk

I think this is most likely a case of "yeah, Windows XP just sucks".
But I have no reason or inclination to get myself a newer Windows to
find out if it's any different.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35200

From	Johannes Bauer <dfnsonfsduifb@gmx.de>
Date	2012-12-20 15:57 +0100
Message-ID	<kav91f$fh0$1@news.albasani.net>
In reply to	#35134

On 19.12.2012 16:40, Chris Angelico wrote:

> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP.

I was not :-( Thanks for the heads up and the good summary on what the
issue was about.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]

#35147

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-12-19 11:27 -0700
Message-ID	<mailman.1068.1355941696.29569.python-list@python.org>
In reply to	#35130

On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <rosuav@gmail.com> wrote:
> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP. In
> 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> pairs* for non-BMP characters. This means that len() counts them
> twice, as does string indexing/slicing. That's a major bug, especially
> as your Python code will do different things on different platforms -
> most Linux builds of 3.2 are "wide" builds, storing characters in four
> bytes each.

>From what I've been able to discern, his actual complaint about PEP
393 stems from misguided moral concerns.  With PEP-393, strings that
can be fully represented in Latin-1 can be stored in half the space
(ignoring fixed overhead) compared to strings containing at least one
non-Latin-1 character.  jmf thinks this optimization is unfair to
non-English users and immoral; he wants Latin-1 strings to be treated
exactly like non-Latin-1 strings (I don't think he actually cares
about non-BMP strings at all; if narrow-build Unicode is good enough
for him, then it must be good enough for everybody).  Unfortunately
for him, the Latin-1 optimization is rather trivial in the wider
context of PEP-393, and simply removing that part alone clearly
wouldn't be doing anybody any favors.  So for him to get what he
wants, the entire PEP has to go.

It's rather like trying to solve the problem of wealth disparity by
forcing everyone to dump their excess wealth into the ocean.

[toc] | [prev] | [next] | [standalone]

#35157

From	wxjmfauth@gmail.com
Date	2012-12-19 13:18 -0800
Message-ID	<1fb2010e-73e4-4025-bb93-12ce7992ddab@googlegroups.com>
In reply to	#35147

Le mercredi 19 décembre 2012 19:27:38 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <rosuav@gmail.com> wrote:
> 
> > You may not be familiar with jmf. He's one of our resident trolls, and
> 
> > he has a bee in his bonnet about PEP 393 strings, on the basis that
> 
> > they take up more space in memory than a narrow build of Python 3.2
> 
> > would, for a string with lots of BMP characters and one non-BMP. In
> 
> > 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> 
> > pairs* for non-BMP characters. This means that len() counts them
> 
> > twice, as does string indexing/slicing. That's a major bug, especially
> 
> > as your Python code will do different things on different platforms -
> 
> > most Linux builds of 3.2 are "wide" builds, storing characters in four
> 
> > bytes each.
> 
> 
> 
> >From what I've been able to discern, his actual complaint about PEP
> 
> 393 stems from misguided moral concerns.  With PEP-393, strings that
> 
> can be fully represented in Latin-1 can be stored in half the space
> 
> (ignoring fixed overhead) compared to strings containing at least one
> 
> non-Latin-1 character.  jmf thinks this optimization is unfair to
> 
> non-English users and immoral; he wants Latin-1 strings to be treated
> 
> exactly like non-Latin-1 strings (I don't think he actually cares
> 
> about non-BMP strings at all; if narrow-build Unicode is good enough
> 
> for him, then it must be good enough for everybody).  Unfortunately
> 
> for him, the Latin-1 optimization is rather trivial in the wider
> 
> context of PEP-393, and simply removing that part alone clearly
> 
> wouldn't be doing anybody any favors.  So for him to get what he
> 
> wants, the entire PEP has to go.
> 
> 
> 
> It's rather like trying to solve the problem of wealth disparity by
> 
> forcing everyone to dump their excess wealth into the ocean.

----

latin-1 (iso-8859-1) ? are you sure ?

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('aé')
39

Time to go to bed. More complete answer tomorrow.

jmf

[toc] | [prev] | [next] | [standalone]

#35160

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-12-19 14:31 -0700
Message-ID	<mailman.1075.1355952735.29569.python-list@python.org>
In reply to	#35157

On Wed, Dec 19, 2012 at 2:18 PM,  <wxjmfauth@gmail.com> wrote:
> latin-1 (iso-8859-1) ? are you sure ?

Yes.

>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('aé')
> 39

Compare to:

>>> sys.getsizeof('a\u0100')
42

The reason for the difference you posted is that pure ASCII strings
have a further optimization, which I glossed over and which is purely
a savings in overhead:

>>> sys.getsizeof('abcde') - sys.getsizeof('a')
4
>>> sys.getsizeof('ábçdê') - sys.getsizeof('á')
4

[toc] | [prev] | [next] | [standalone]

#35214

From	wxjmfauth@gmail.com
Date	2012-12-20 11:40 -0800
Message-ID	<01f08872-fd5f-40ba-9abd-61b01d05c42a@googlegroups.com>
In reply to	#35160

Le mercredi 19 décembre 2012 22:31:42 UTC+1, Ian a écrit :
> On Wed, Dec 19, 2012 at 2:18 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > latin-1 (iso-8859-1) ? are you sure ?
> 
> 
> 
> Yes.
> 
> 
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('ab')
> 
> > 27
> 
> >>>> sys.getsizeof('aé')
> 
> > 39
> 
> 
> 
> Compare to:
> 
> 
> 
> >>> sys.getsizeof('a\u0100')
> 
> 42
> 
> 
> 
> The reason for the difference you posted is that pure ASCII strings
> 
> have a further optimization, which I glossed over and which is purely
> 
> a savings in overhead:
> 
> 
> 
> >>> sys.getsizeof('abcde') - sys.getsizeof('a')
> 
> 4
> 
> >>> sys.getsizeof('ábçdê') - sys.getsizeof('á')
> 
> 4

-----

I know all of this. And this is exactly, what I explained.
I do not care about this optimization. I'm not an ascii user.
As a non ascii user, this optimization is just irrelevant.

What should a Python user think, if he sees his strings
are comsuming more memory just because he uses non ascii
characters or he sees his strings are changing just because
he "uppercases" them.
Unicode is here to serve anybody.

jmf

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

Py 3.3, unicode / upper()

Contents

#35115 — Py 3.3, unicode / upper()

#35124

#35125

#35155

#35159

#35215

#35231

#35172

#35176

#35156

#35128

#35129

#35130

#35131

#35134

#35200

#35147

#35157

#35160

#35214