Groups > comp.lang.python > #35115 > unrolled thread

Py 3.3, unicode / upper()

Started by	wxjmfauth@gmail.com
First post	2012-12-19 06:23 -0800
Last post	2012-12-20 17:34 -0700
Articles	7 on this page of 47 — 13 participants

Back to article view | Back to comp.lang.python

  Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 06:23 -0800
    Re: Py 3.3, unicode / upper() Thomas Bach <thbach@students.uni-mainz.de> - 2012-12-19 15:43 +0100
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 15:52 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
        Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:23 -0700
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
          Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:42 -0800
        Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:01 +1100
        Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 18:53 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 12:55 -0800
    Re: Py 3.3, unicode / upper() Stefan Krah <stefan-usenet@bytereef.org> - 2012-12-19 16:01 +0100
    Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:17 +1100
    Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:18 +0100
      Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-19 16:22 +0100
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 02:40 +1100
        Re: Py 3.3, unicode / upper() Johannes Bauer <dfnsonfsduifb@gmx.de> - 2012-12-20 15:57 +0100
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 11:27 -0700
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
          Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-19 14:31 -0700
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
              Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:48 -0500
              Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 22:51 +0000
            Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:40 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-19 13:18 -0800
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 19:39 -0500
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 13:03 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-19 21:54 -0500
      Re: Py 3.3, unicode / upper() Westley Martínez <anikom15@gmail.com> - 2012-12-19 19:12 -0800
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-20 14:22 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 00:32 -0500
        Re: Py 3.3, unicode / upper() Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-20 05:51 +0000
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
          Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:30 -0500
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:57 -0800
      Re: Py 3.3, unicode / upper() Serhiy Storchaka <storchaka@gmail.com> - 2012-12-27 21:00 +0200
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
        Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-27 11:36 -0800
    Re: Py 3.3, unicode / upper() Christian Heimes <christian@python.org> - 2012-12-19 16:33 +0100
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
      Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-29 11:16 -0800
    Re: Py 3.3, unicode / upper() Benjamin Peterson <benjamin@python.org> - 2012-12-19 20:25 +0000
    Re: Py 3.3, unicode / upper() wxjmfauth@gmail.com - 2012-12-20 11:19 -0800
      Re: Py 3.3, unicode / upper() MRAB <python@mrabarnett.plus.com> - 2012-12-20 20:20 +0000
      Re: Py 3.3, unicode / upper() Chris Angelico <rosuav@gmail.com> - 2012-12-21 08:19 +1100
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:12 -0500
      Re: Py 3.3, unicode / upper() Terry Reedy <tjreedy@udel.edu> - 2012-12-20 17:59 -0500
      Re: Py 3.3, unicode / upper() Ian Kelly <ian.g.kelly@gmail.com> - 2012-12-20 17:34 -0700

Page 3 of 3 — ← Prev page 1 2 [3]

#35152

From	Benjamin Peterson <benjamin@python.org>
Date	2012-12-19 20:25 +0000
Message-ID	<mailman.1070.1355948766.29569.python-list@python.org>
In reply to	#35115

 <wxjmfauth <at> gmail.com> writes:
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because that's what the Unicode spec says to do.

[toc] | [prev] | [next] | [standalone]

#35212

From	wxjmfauth@gmail.com
Date	2012-12-20 11:19 -0800
Message-ID	<a633063d-0887-48aa-8fd5-03d2bfe7d0f1@googlegroups.com>
In reply to	#35115

Fact.
In order to work comfortably and with efficiency with a "scheme for
the coding of the characters", can be unicode or any coding scheme,
one has to take into account two things: 1) work with a unique set
of characters and 2) work with a contiguous block of code points.

At this point, it should be noticed I did not even wrote about
the real coding, only about characters and code points.

Now, let's take a look at what happens when one breaks the rules
above and, precisely, if one attempts to work with multiple
characters sets or if one divides - artificially - the whole range
of the unicode code points in chunks.

The first (and it should be quite obvious) consequence is that
you create bloated, unnecessary and useless code. I simplify
the flexible string representation (FSR) and will use an "ascii" / 
"non-ascii" model/terminology.

If you are an "ascii" user, a FSR model has no sense. An
"ascii" user will use, per definition, only "ascii characters".

If you are a "non-ascii" user, the FSR model is also a non
sense, because you are per definition a n"on-ascii" user of
"non-ascii" character. Any optimisation for "ascii" user just
become irrelevant. 

In one sense, to escape from this, you have to be at the same time
a non "ascii" user and a non "non-ascii" user. Impossible.
In both cases, a FSR model is useless and in both cases you are
forced to use bloated and unnecessary code.

The rule is to treat every character of a unique set of characters
of a coding scheme in, how to say, an "equal way". The problematic
can be seen the other way, every coding scheme has been built
to work with a unique set of characters, otherwhile it is not
properly working!

The second negative aspect of this splitting, is just the 
splitting itsself. One can optimize every subset of characters,
one will always be impacted by the "switch" between the subsets.
One more reason to work with a unique set characters or this is
the reason why every coding scheme handle a unique set of
characters.

Up to now, I spoke only about the characters and the sets of
characters, not about the coding of the characters.
There is a point which is quite hard to understand and also hard
to explain. It becomes obvious with some experience.

When one works with a coding scheme, one always has to think
characters / code points. If one takes the perspective of encoded
code points, it simply does not work or may not work very well
(memory/speed). The whole problematic is that it is impossible to
work with characters, one is forced to manipulate encoded code
points as characters. Unicode is built and though to work with
code points, not with encoded code points. The serialization,
transformation code point -> encoded code point, is "only" a
technical and secondary process. Surprise, all the unicode
coding schemes (utf-8, 16, 32) are working with the same
set of characters. They differ in the serialization, but
they are all working with a unique set of characters.
The utf-16 / ucs-2 is an interesting case. Their encoding mechanisms
are quasi the same, the difference lies in the sets of characters.

There is an another way to empiricaly understand the problem.
The historical evolution of the coding of the characters. Practically,
all the coding schemes have been created to handle different sets of
characters or coding schemes have been created, because it is the
only way to work properly. If it would have been possible to work
with multiple coding schemes, I'm pretty sure a solution would
have emerged. It never happened and it would not have been necessary
to create iso10646 or unicode. Neither it would have been necessary
to create all these codings iso-8859-***, cp***, mac** which are
all *based on set of characters*.

plan9 had attempted to work with multiple characters set, it did not
work very well, main issue: the switch between the codings.

A solution à la FSR can not work or not work in a optimized way.
It is not a coding scheme, it is a composite of coding schemes
handling several characters sets. Hard to imagine something worse.

Contrary to what has been said, the bad cases I presented here are
not corner cases. There is practically and systematically a regression
in Py33 compared to Py32.
That's very easy to test. I did all my tests at the light of what
I explained above. I was not a suprise for me to this expectidly
bad behaviour.

Python is not my tool. If I'm allowing me to give an advice, a
scientifical approach.
I suggest the core devs to firstly spend their time to proof
a FSR model can beat the existing models (purely on the C level).
Then, if they succeeded, to later implement this.

My feeling is that most of the people are defending this FSR simply
because it exists, not because of its intrisic quality.

Hint: I suggest the experts to take a comprehensive look at the
cmap table of the OpenType fonts (pure unicode technology).
Those people know how to work.

I would be very happy to be wrong. Unfortunately, I'm affraid
it's not the case.

jmf

[toc] | [prev] | [next] | [standalone]

#35218

From	MRAB <python@mrabarnett.plus.com>
Date	2012-12-20 20:20 +0000
Message-ID	<mailman.1107.1356034858.29569.python-list@python.org>
In reply to	#35212

On 2012-12-20 19:19, wxjmfauth@gmail.com wrote:
> Fact.
> In order to work comfortably and with efficiency with a "scheme for
> the coding of the characters", can be unicode or any coding scheme,
> one has to take into account two things: 1) work with a unique set
> of characters and 2) work with a contiguous block of code points.
>
> At this point, it should be noticed I did not even wrote about
> the real coding, only about characters and code points.
>
> Now, let's take a look at what happens when one breaks the rules
> above and, precisely, if one attempts to work with multiple
> characters sets or if one divides - artificially - the whole range
> of the unicode code points in chunks.
>
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
>
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.
>
> The rule is to treat every character of a unique set of characters
> of a coding scheme in, how to say, an "equal way". The problematic
> can be seen the other way, every coding scheme has been built
> to work with a unique set of characters, otherwhile it is not
> properly working!
>
[snip]
It's true that in an ideal world you would treat all codepoints the
same. However, this is a case where "practicality beats purity".

In order to accommodate every codepoint you need 3 bytes per codepoint
(although for pragmatic reasons it's 4 bytes per codepoint).

But not all codepoints are used equally. Those in the "astral plane",
for example, are used rarely, so the vast majority of the time you
would be using twice as much memory as strictly necessary. There are
also, in reality, many times in which strings contain only ASCII-range
codepoints, although they may not be visible to the average user, being
the names of functions and attributes in program code, or tags and
attributes in HTML and XML.

FSR is a pragmatic solution to dealing with limited resources.

Would you prefer there to be a switch that makes strings always use 4
bytes per codepoint for those users and systems where memory is no
object?

[toc] | [prev] | [next] | [standalone]

#35230

From	Chris Angelico <rosuav@gmail.com>
Date	2012-12-21 08:19 +1100
Message-ID	<mailman.1113.1356038360.29569.python-list@python.org>
In reply to	#35212

On Fri, Dec 21, 2012 at 7:20 AM, MRAB <python@mrabarnett.plus.com> wrote:
> On 2012-12-20 19:19, wxjmfauth@gmail.com wrote:
>> The rule is to treat every character of a unique set of characters
>> of a coding scheme in, how to say, an "equal way". The problematic
>> can be seen the other way, every coding scheme has been built
>> to work with a unique set of characters, otherwhile it is not
>> properly working!
>>
> It's true that in an ideal world you would treat all codepoints the
> same. However, this is a case where "practicality beats purity".

Actually no. Not all codepoints are the same. Ever heard of Huffman
coding? It's a broad technique used in everything from PK-ZIP/gzip
file compression to the Morse code ("here come dots!"). It exploits
and depends on a dramatically unequal usage distribution pattern, as
all text (he will ask "All?" You will respond "All!" He will
understand -- referring to Caeser) exhibits.

In the case of strings in a Python program, it's fairly obvious that
there will be *many* that are ASCII-only; and what's more, most of the
long strings will either be ASCII-only or have a large number of
non-ASCII characters. However, your microbenchmarks usually look at
two highly unusual cases: either a string with a huge number of ASCII
chars and one non-ASCII, or all the same non-ASCII (usually for your
replace() tests). I haven't seen strings like either of those come up.

Can you show us a performance regression in an  *actual* *production*
*program*? And make sure you're comparing against a wide build, here.

ChrisA

[toc] | [prev] | [next] | [standalone]

#35233

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-20 17:12 -0500
Message-ID	<mailman.1115.1356041561.29569.python-list@python.org>
In reply to	#35212

On 12/20/2012 2:19 PM, wxjmfauth@gmail.com wrote:

>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.

This is a false dichotomy. Conclusions based on falsity are false.

> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.

This is wrong. Every Python user is an ascii user. All names in the 
stdlib are ascii-only. These names all become strings in code objects. 
All docstrings (with a couple of rare exceptions) are ascii-only. They 
also become strings. *Every Python user* benefits from the new system in 
3.3.

Some Python users are also non-ascii user. This include many English 
speakers, as many English texts include non-ascii characters. (Just for 
starters, the copyright and trademark symbols are not in the ascii set.)

> Contrary to what has been said, the bad cases I presented here are
> not corner cases. There is practically and systematically a regression
> in Py33 compared to Py32.

I posted evidence otherwise. Jim never responded to those posts. Instead 
he repeats the falsehood refuted by evidence.

> That's very easy to test.

Yes. Run stringbench.py on the OS/machine on 3.2 and 3.3 as I did.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35240

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-20 17:59 -0500
Message-ID	<mailman.1119.1356044394.29569.python-list@python.org>
In reply to	#35212

On 12/20/2012 2:19 PM, wxjmfauth@gmail.com wrote:

> My feeling is that most of the people are defending this FSR simply
> because it exists, not because of its intrisic quality.

The fact, contrary to your feeling, is that I was initially dubious that 
is could be made to work as well as it does. I was only really convinced 
when I ran stringbench in response to your over-genralized assertions.

It is also a fact that I proposed on the tracker and pydev list a 
different method of fixing the length and index bugs in narrow builds. 
It only saved space relative to wide builds but did not have the 
additional space-saving of the new scheme for ascii and latin-1 text.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#35243

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-12-20 17:34 -0700
Message-ID	<mailman.1121.1356050077.29569.python-list@python.org>
In reply to	#35212

On Thu, Dec 20, 2012 at 12:19 PM,  <wxjmfauth@gmail.com> wrote:
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
>
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.

As Terry and Steven have already pointed out, there is no such thing
as a "non-ascii" user.  Here I will take the complementary approach
and point out that there is also no such thing as an "ascii" user.
There are only users whose strings are 99.99% (or more) ASCII.  A user
may think that his program will never be given any non-ASCII input to
deal with, but experience tells us that this thought is probably
wrong.

Suppose you were to split the Unicode representation into separate
"ASCII-only" and "wide" data types.  Then which data type is the
correct one to choose for an "ascii" user?  The correct answer is
*always* the wide data type, for the reason stated above.  If the user
chooses the ASCII-only data type, then as soon his program encounters
non-ASCII data, it breaks.  The only users of the ASCII-only data type
then would be the authors of buggy programs.  The same issue applies
to narrow (UTF-16) data types.  So there really are only two viable,
non-buggy options for Unicode representations: FSR, or always wide
(UTF-32).  The latter is wildly inefficient in many cases, so Python
went with FSR.

A third option might be proposed, which would be to have a build
switch between FSR or always wide, with the promise that the two will
be indistinguishable at the Python level (apart from the amount of
memory used).  This is probably not on the table, however, as it would
have a non-negligible maintenance cost, and it's not clear that
anybody other than you would actually want it.

> A solution à la FSR can not work or not work in a optimized way.
> It is not a coding scheme, it is a composite of coding schemes
> handling several characters sets. Hard to imagine something worse.

It is not a composite of coding schemes.  The str type deals with
exactly *one* character set -- the UCS.  The different representations
are not different coding schemes.  They are *all* UTF-32.  The only
significant difference between the representations is that the leading
zero bytes of each character are made implicit (i.e. truncated) if the
nature of the string allows it.

> Contrary to what has been said, the bad cases I presented here are
> not corner cases.

The only significantly regressive case that you've presented here has
been str.replace on inputs engineered for bad performance.  That's why
people characterize them as corner cases -- because that's exactly
what they are.

> There is practically and systematically a regression
> in Py33 compared to Py32.
> That's very easy to test. I did all my tests at the light of what
> I explained above. I was not a suprise for me to this expectidly
> bad behaviour.

Have you run stringbench.py yet?  When I ran it on my system, the full
set of Unicode benchmarks ran in 268.15 seconds for Python 3.2 versus
198.77 seconds for Python 3.3.  That's a 26% overall speedup for the
covered benchmarks, which seem reasonably thorough.  That does not
demonstrate a "systematic regression".  If anything, that shows a
systematic improvement.

Your cherry-picking of benchmarks is like a driver who has two routes
to their destination; one takes ten minutes on average but has one
annoyingly long traffic light, while the second takes fifteen minutes
on average but has no traffic lights (and a correspondingly higher
accident rate).  Yet for some reason you insist that the second route
is better because the traffic light makes the first route
"systematically" slower.

[toc] | [prev] | [standalone]

Page 3 of 3 — ← Prev page 1 2 [3]

csiph-web

Py 3.3, unicode / upper()

Contents

#35152

#35212

#35218

#35230

#35233

#35240

#35243