Re: Is Unicode support so hard...

References	<d9798b4e-2825-4a36-93a3-f8a03d37a4bc@b3g2000vbo.googlegroups.com> <5172CEE4.9070403@nedbatchelder.com>
Date	2013-04-21 04:14 +1000
Subject	Re: Is Unicode support so hard...
From	Chris Angelico <rosuav@gmail.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.859.1366481699.3114.python-list@python.org> (permalink)

Show all headers | View raw

On Sun, Apr 21, 2013 at 3:22 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
> I'm totally confused about what you are saying.  What does "make a better
> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
> In what way?  Can you provide specifics?  Or are you saying that you like
> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
> know what you mean.

You're not familiar with jmf? He's one of our resident trolls. Allow
me to summarize Python 3's Unicode support...

>From 3.0 up to and including 3.2.x, Python could be built as either
"narrow" or "wide". A wide build consumes four bytes per character in
every string, which is rather wasteful (given that very few strings
actually NEED that); a narrow build gets some things wrong. (I'm using
a 2.7 here as I don't have a narrow-build 3.x handy; the same
considerations apply, though.)

Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> len(u"asdf\U00012345qwer")
10
>>> u"asdf\U00012345qwer"[8]
u'e'

In a narrow build, strings are stored in UTF-16, so astral characters
count as two. This means that a program will behave unexpectedly
differently on different platforms (other languages, such as
ECMAScript, actually *mandate* UTF-16; at least this means you can
depend on this otherwise-bizarre behaviour regardless of what platform
you're on), and I have to say this is counter-intuitive.

Enter Python 3.3 and PEP 393 strings. Now *EVERY* Python build is,
conceptually, wide. (I'm not sure how PEP 393 applies to other Pythons
- Jython, PyPy, etc - so assume that whenever I refer to Python, I'm
restricting this to CPython.) The underlying representation might be
more efficient, but to the script, it's exactly the same as a wide
build. If a string has no characters that demand more width, it'll be
stored nice and narrow. (It's the same technique that Pike has been
using for a while, so it's a proven system; in any case, we know that
this is going to work, it's just a question of performance - it adds a
fixed overhead.) Great! We save memory in Python programs. Wonderful!
Right?

Enter jmf. No, it's not wonderful, because OBVIOUSLY Python is now
America-centric, because now the full Unicode range is divided into
"these ones get stored in 1 byte per char, these in 2, these in 4".
Clearly that's making life way worse for everyone else. Also, compared
to the narrow build that jmf was previously using, this uses heaps
MORE space in the stupid micro-benchmarks that he keeps on trotting
out, because he has just one astral character in a sea of ASCII. And
that's totally what programs are doing all the time, too. Never mind
that basic operations like length, slicing, etc are no longer buggy,
no, Python has taken a terrible step backwards here.

Oh, and check this out:

>>> def munge(s):
	"""Move characters around in a string."""
	l=len(s)//4
	return s[:l]+s[l*2:l*3]+s[l:l*2]+s[l*3:]

>>> munge("asdfqwerzxcv1234")
'asdfzxcvqwer1234'

Looks fine.

>>> munge(u"asd\U00012345we\U00034567xc\U00023456bla")
u'asd\U00012167xc\U00023745we\U00034456bla'

Where'd those characters come from? I was just moving stuff around,
right? I can't get new characters out of it... can I?

Flash forward to current date, and jmf has hijacked so many threads to
moan about PEP 393 that I'm actually happy about this one, simply
because he gave it a new subject line and one appropriate to a
discussion about Unicode.

ChrisA

Thread

Is Unicode support so hard... jmfauth <wxjmfauth@gmail.com> - 2013-04-20 10:12 -0700
  Re: Is Unicode support so hard... Ned Batchelder <ned@nedbatchelder.com> - 2013-04-20 13:22 -0400
  Re: Is Unicode support so hard... Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-04-20 11:02 -0700
  Re: Is Unicode support so hard... Chris Angelico <rosuav@gmail.com> - 2013-04-21 04:14 +1000
  Re: Is Unicode support so hard... Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-04-20 20:15 +0200
  Re: Is Unicode support so hard... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-20 19:18 +0100
  Re: Is Unicode support so hard... Neil Hodgson <nhodgson@iinet.net.au> - 2013-04-21 09:03 +1000
    Re: Is Unicode support so hard... rusi <rustompmody@gmail.com> - 2013-04-20 18:37 -0700
      Re: Is Unicode support so hard... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-21 03:36 +0000
        Re: Is Unicode support so hard... Chris Angelico <rosuav@gmail.com> - 2013-04-21 13:42 +1000
      Re: Is Unicode support so hard... Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-21 05:02 -0400
      Re: Is Unicode support so hard... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-21 13:03 +0100
  Re: Is Unicode support so hard... Ethan Furman <ethan@stoneleaf.us> - 2013-04-20 18:06 -0700
  Re: Is Unicode support so hard... 88888 Dihedral <dihedral88888@googlemail.com> - 2013-04-20 23:09 -0700

csiph-web