Groups > comp.lang.python > #43958 > unrolled thread

Is Unicode support so hard...

Started by	jmfauth <wxjmfauth@gmail.com>
First post	2013-04-20 10:12 -0700
Last post	2013-04-20 23:09 -0700
Articles	14 — 12 participants

Back to article view | Back to comp.lang.python

  Is Unicode support so hard... jmfauth <wxjmfauth@gmail.com> - 2013-04-20 10:12 -0700
    Re: Is Unicode support so hard... Ned Batchelder <ned@nedbatchelder.com> - 2013-04-20 13:22 -0400
    Re: Is Unicode support so hard... Benjamin Kaplan <benjamin.kaplan@case.edu> - 2013-04-20 11:02 -0700
    Re: Is Unicode support so hard... Chris Angelico <rosuav@gmail.com> - 2013-04-21 04:14 +1000
    Re: Is Unicode support so hard... Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-04-20 20:15 +0200
    Re: Is Unicode support so hard... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-20 19:18 +0100
    Re: Is Unicode support so hard... Neil Hodgson <nhodgson@iinet.net.au> - 2013-04-21 09:03 +1000
      Re: Is Unicode support so hard... rusi <rustompmody@gmail.com> - 2013-04-20 18:37 -0700
        Re: Is Unicode support so hard... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-21 03:36 +0000
          Re: Is Unicode support so hard... Chris Angelico <rosuav@gmail.com> - 2013-04-21 13:42 +1000
        Re: Is Unicode support so hard... Terry Jan Reedy <tjreedy@udel.edu> - 2013-04-21 05:02 -0400
        Re: Is Unicode support so hard... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-04-21 13:03 +0100
    Re: Is Unicode support so hard... Ethan Furman <ethan@stoneleaf.us> - 2013-04-20 18:06 -0700
    Re: Is Unicode support so hard... 88888 Dihedral <dihedral88888@googlemail.com> - 2013-04-20 23:09 -0700

#43958 — Is Unicode support so hard...

From	jmfauth <wxjmfauth@gmail.com>
Date	2013-04-20 10:12 -0700
Subject	Is Unicode support so hard...
Message-ID	<d9798b4e-2825-4a36-93a3-f8a03d37a4bc@b3g2000vbo.googlegroups.com>

In a previous post,

http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
,

Chris “Kwpolska” Warrick wrote:

“Is Unicode support so hard, especially in the 21st century?”

--

Unicode is not really complicate and it works very well (more
than two decades of development if you take into account
iso-14****).

But, - I can say, "as usual" - people prefer to spend their
time to make a "better Unicode than Unicode" and it usually
fails. Python does not escape to this rule.

-----

I'm "busy" with TeX (unicode engine variant), fonts and typography.
This gives me plenty of ideas to test the "flexible string
representation" (FSR). I should recognize this FSR is failing
particulary very well...

I can almost say, a delight.

jmf
Unicode lover

[toc] | [next] | [standalone]

#43959

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-04-20 13:22 -0400
Message-ID	<mailman.856.1366478585.3114.python-list@python.org>
In reply to	#43958

On 4/20/2013 1:12 PM, jmfauth wrote:
> In a previous post,
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
> ,
>
> Chris “Kwpolska” Warrick wrote:
>
> “Is Unicode support so hard, especially in the 21st century?”
>
> --
>
> Unicode is not really complicate and it works very well (more
> than two decades of development if you take into account
> iso-14****).
>
> But, - I can say, "as usual" - people prefer to spend their
> time to make a "better Unicode than Unicode" and it usually
> fails. Python does not escape to this rule.
>
> -----
>
> I'm "busy" with TeX (unicode engine variant), fonts and typography.
> This gives me plenty of ideas to test the "flexible string
> representation" (FSR). I should recognize this FSR is failing
> particulary very well...
>
> I can almost say, a delight.
>
> jmf
> Unicode lover
I'm totally confused about what you are saying.  What does "make a 
better Unicode than Unicode" mean?  Are you saying that Python is guilty 
of this?  In what way?  Can you provide specifics?  Or are you saying 
that you like how Python has implemented it?  "FSR is failing ... a 
delight"?  I don't know what you mean.

--Ned.

[toc] | [prev] | [next] | [standalone]

#43961

From	Benjamin Kaplan <benjamin.kaplan@case.edu>
Date	2013-04-20 11:02 -0700
Message-ID	<mailman.858.1366481215.3114.python-list@python.org>
In reply to	#43958

On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
> On 4/20/2013 1:12 PM, jmfauth wrote:
>>
>> In a previous post,
>>
>>
>> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
>> ,
>>
>> Chris “Kwpolska” Warrick wrote:
>>
>> “Is Unicode support so hard, especially in the 21st century?”
>>
>> --
>>
>> Unicode is not really complicate and it works very well (more
>> than two decades of development if you take into account
>> iso-14****).
>>
>> But, - I can say, "as usual" - people prefer to spend their
>> time to make a "better Unicode than Unicode" and it usually
>> fails. Python does not escape to this rule.
>>
>> -----
>>
>> I'm "busy" with TeX (unicode engine variant), fonts and typography.
>> This gives me plenty of ideas to test the "flexible string
>> representation" (FSR). I should recognize this FSR is failing
>> particulary very well...
>>
>> I can almost say, a delight.
>>
>> jmf
>> Unicode lover
>
> I'm totally confused about what you are saying.  What does "make a better
> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
> In what way?  Can you provide specifics?  Or are you saying that you like
> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
> know what you mean.
>
> --Ned.

Don't bother trying to figure this out. jmfauth has been hijacking
every thread that mentions Unicode to complain about the flexible
string representation introduced in Python 3.3. Apparently, having
proper Unicode semantics (indexing is based on characters, not code
points) at the expense of performance when calling .replace on the
only non-ASCII or BMP character in the string is a horrible bug.

[toc] | [prev] | [next] | [standalone]

#43962

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-21 04:14 +1000
Message-ID	<mailman.859.1366481699.3114.python-list@python.org>
In reply to	#43958

On Sun, Apr 21, 2013 at 3:22 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
> I'm totally confused about what you are saying.  What does "make a better
> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
> In what way?  Can you provide specifics?  Or are you saying that you like
> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
> know what you mean.

You're not familiar with jmf? He's one of our resident trolls. Allow
me to summarize Python 3's Unicode support...

>From 3.0 up to and including 3.2.x, Python could be built as either
"narrow" or "wide". A wide build consumes four bytes per character in
every string, which is rather wasteful (given that very few strings
actually NEED that); a narrow build gets some things wrong. (I'm using
a 2.7 here as I don't have a narrow-build 3.x handy; the same
considerations apply, though.)

Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> len(u"asdf\U00012345qwer")
10
>>> u"asdf\U00012345qwer"[8]
u'e'

In a narrow build, strings are stored in UTF-16, so astral characters
count as two. This means that a program will behave unexpectedly
differently on different platforms (other languages, such as
ECMAScript, actually *mandate* UTF-16; at least this means you can
depend on this otherwise-bizarre behaviour regardless of what platform
you're on), and I have to say this is counter-intuitive.

Enter Python 3.3 and PEP 393 strings. Now *EVERY* Python build is,
conceptually, wide. (I'm not sure how PEP 393 applies to other Pythons
- Jython, PyPy, etc - so assume that whenever I refer to Python, I'm
restricting this to CPython.) The underlying representation might be
more efficient, but to the script, it's exactly the same as a wide
build. If a string has no characters that demand more width, it'll be
stored nice and narrow. (It's the same technique that Pike has been
using for a while, so it's a proven system; in any case, we know that
this is going to work, it's just a question of performance - it adds a
fixed overhead.) Great! We save memory in Python programs. Wonderful!
Right?

Enter jmf. No, it's not wonderful, because OBVIOUSLY Python is now
America-centric, because now the full Unicode range is divided into
"these ones get stored in 1 byte per char, these in 2, these in 4".
Clearly that's making life way worse for everyone else. Also, compared
to the narrow build that jmf was previously using, this uses heaps
MORE space in the stupid micro-benchmarks that he keeps on trotting
out, because he has just one astral character in a sea of ASCII. And
that's totally what programs are doing all the time, too. Never mind
that basic operations like length, slicing, etc are no longer buggy,
no, Python has taken a terrible step backwards here.

Oh, and check this out:

>>> def munge(s):
	"""Move characters around in a string."""
	l=len(s)//4
	return s[:l]+s[l*2:l*3]+s[l:l*2]+s[l*3:]

>>> munge("asdfqwerzxcv1234")
'asdfzxcvqwer1234'

Looks fine.

>>> munge(u"asd\U00012345we\U00034567xc\U00023456bla")
u'asd\U00012167xc\U00023745we\U00034456bla'

Where'd those characters come from? I was just moving stuff around,
right? I can't get new characters out of it... can I?

Flash forward to current date, and jmf has hijacked so many threads to
moan about PEP 393 that I'm actually happy about this one, simply
because he gave it a new subject line and one appropriate to a
discussion about Unicode.

ChrisA

[toc] | [prev] | [next] | [standalone]

#43963

From	Chris “Kwpolska” Warrick <kwpolska@gmail.com>
Date	2013-04-20 20:15 +0200
Message-ID	<mailman.860.1366481736.3114.python-list@python.org>
In reply to	#43958

On Sat, Apr 20, 2013 at 8:02 PM, Benjamin Kaplan
<benjamin.kaplan@case.edu> wrote:
> On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> On 4/20/2013 1:12 PM, jmfauth wrote:
>>>
>>> In a previous post,
>>>
>>>
>>> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
>>> ,
>>>
>>> Chris “Kwpolska” Warrick wrote:
>>>
>>> “Is Unicode support so hard, especially in the 21st century?”
>>>
>>> --
>>>
>>> Unicode is not really complicate and it works very well (more
>>> than two decades of development if you take into account
>>> iso-14****).
>>>
>>> But, - I can say, "as usual" - people prefer to spend their
>>> time to make a "better Unicode than Unicode" and it usually
>>> fails. Python does not escape to this rule.
>>>
>>> -----
>>>
>>> I'm "busy" with TeX (unicode engine variant), fonts and typography.
>>> This gives me plenty of ideas to test the "flexible string
>>> representation" (FSR). I should recognize this FSR is failing
>>> particulary very well...
>>>
>>> I can almost say, a delight.
>>>
>>> jmf
>>> Unicode lover
>>
>> I'm totally confused about what you are saying.  What does "make a better
>> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
>> In what way?  Can you provide specifics?  Or are you saying that you like
>> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
>> know what you mean.
>>
>> --Ned.
>
> Don't bother trying to figure this out. jmfauth has been hijacking
> every thread that mentions Unicode to complain about the flexible
> string representation introduced in Python 3.3. Apparently, having
> proper Unicode semantics (indexing is based on characters, not code
> points) at the expense of performance when calling .replace on the
> only non-ASCII or BMP character in the string is a horrible bug.
> --
> http://mail.python.org/mailman/listinfo/python-list

Don’t forget the original context: this was a short remark to a guy I
was responding to.  His newsgroups software (slrn according to the
headers) mangled the encoding of U+201C and U+201D in my From field,
turning them into three question marks each.  And jmf started a rant,
as usual…

PS. There are two fancy Unicode characters around.  Can you find both
of them, jmf?

--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail                | always bottom-post
http://asciiribbon.org        | http://caliburn.nl/topposting.html

[toc] | [prev] | [next] | [standalone]

#43964

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-04-20 19:18 +0100
Message-ID	<mailman.861.1366481904.3114.python-list@python.org>
In reply to	#43958

On 20/04/2013 19:02, Benjamin Kaplan wrote:
> On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> On 4/20/2013 1:12 PM, jmfauth wrote:
>>>
>>> In a previous post,
>>>
>>>
>>> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
>>> ,
>>>
>>> Chris “Kwpolska” Warrick wrote:
>>>
>>> “Is Unicode support so hard, especially in the 21st century?”
>>>
>>> --
>>>
>>> Unicode is not really complicate and it works very well (more
>>> than two decades of development if you take into account
>>> iso-14****).
>>>
>>> But, - I can say, "as usual" - people prefer to spend their
>>> time to make a "better Unicode than Unicode" and it usually
>>> fails. Python does not escape to this rule.
>>>
>>> -----
>>>
>>> I'm "busy" with TeX (unicode engine variant), fonts and typography.
>>> This gives me plenty of ideas to test the "flexible string
>>> representation" (FSR). I should recognize this FSR is failing
>>> particulary very well...
>>>
>>> I can almost say, a delight.
>>>
>>> jmf
>>> Unicode lover
>>
>> I'm totally confused about what you are saying.  What does "make a better
>> Unicode than Unicode" mean?  Are you saying that Python is guilty of this?
>> In what way?  Can you provide specifics?  Or are you saying that you like
>> how Python has implemented it?  "FSR is failing ... a delight"?  I don't
>> know what you mean.
>>
>> --Ned.
>
> Don't bother trying to figure this out. jmfauth has been hijacking
> every thread that mentions Unicode to complain about the flexible
> string representation introduced in Python 3.3. Apparently, having
> proper Unicode semantics (indexing is based on characters, not code
> points) at the expense of performance when calling .replace on the
> only non-ASCII or BMP character in the string is a horrible bug.
>

He can't complain about performance for the .replace issue any more as 
it's been fixed http://bugs.python.org/issue16061

Sadly he'll almost certainly have more edge cases up his sleeve while 
continuing to ignore minor issues like memory saving and correctness.

-- 
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#43971

From	Neil Hodgson <nhodgson@iinet.net.au>
Date	2013-04-21 09:03 +1000
Message-ID	<0vKdnXSde-E8gu7MnZ2dnUVZ_sidnZ2d@westnet.com.au>
In reply to	#43958

    Hi jmf,

> This gives me plenty of ideas to test the "flexible string
> representation" (FSR). I should recognize this FSR is failing
> particulary very well...

    This is too vague for me.

    Which string representation should Python use?
1) UTF-32
2) UTF-8
3) Python 3.3 -- 1, 2, or 4 bytes per character decided at runtime
4) Python 3.2 -- 2 or 4 bytes per character decided at Python build time
5) Something else

    Neil

[toc] | [prev] | [next] | [standalone]

#43981

From	rusi <rustompmody@gmail.com>
Date	2013-04-20 18:37 -0700
Message-ID	<7be0ad7b-bc35-4ea2-aa16-b9af535b455e@ys5g2000pbc.googlegroups.com>
In reply to	#43971

On Apr 21, 4:03 am, Neil Hodgson <nhodg...@iinet.net.au> wrote:
>     Hi jmf,
>
> > This gives me plenty of ideas to test the "flexible string
> > representation" (FSR). I should recognize this FSR is failing
> > particulary very well...
>
>     This is too vague for me.
>
>     Which string representation should Python use?
> 1) UTF-32
> 2) UTF-8
> 3) Python 3.3 -- 1, 2, or 4 bytes per character decided at runtime
> 4) Python 3.2 -- 2 or 4 bytes per character decided at Python build time
> 5) Something else

jmf recommends UTF-8.

Apart from the fact the UTF-8 would be less (time) performant in all
cases and more extremely so in cases like indexing, the fact that jmf
says so makes it more ridiculous.
According to jmf python sucks up to ASCII (those big bad Americans… of
whom Steven is the first…) whereas unicode is the true international/
universal standard.

I guess the irony is clear to all (except jmf) given that:
- its unicode that sucks up to ASCII by carefully conforming in the
first 127 positions including the completely useless control chars;
python just implements the standard
- UTF-8 is an ASCII-biased unicode-compression method viz UTF-8 is
most space-efficient on ASCII at the cost of being generally time-
inefficient
- All jmf's beefs (as far as I remember) are variations on the theme:
"time-inefficiency is equivalent to non-unicode-compliant"

In short he manifests a dog-in-the-manger mindset:
"Since the whole world will never speak french (grief, mope, grumble,
thrash…) everyone should pay for the Chinese character set's size even
if they are monolingually English"

All that said…

I believe that the recent correction in unicode performance followed
jmf's grumbles
(Mark please correct me if I am wrong)
So python community can be thankful to jmf even if he insists on
laboring under bizarre political hallucinations.

[Written from India where a monolingual person is as rare as a
palmtree on a polecap]

[toc] | [prev] | [next] | [standalone]

#43984

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-04-21 03:36 +0000
Message-ID	<51735ed9$0$29977$c3e8da3$5496439d@news.astraweb.com>
In reply to	#43981

On Sat, 20 Apr 2013 18:37:00 -0700, rusi wrote:

> According to jmf python sucks up to ASCII (those big bad Americans… of
> whom Steven is the first…) 

Watch who you're calling an American, mate.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#43985

From	Chris Angelico <rosuav@gmail.com>
Date	2013-04-21 13:42 +1000
Message-ID	<mailman.871.1366515768.3114.python-list@python.org>
In reply to	#43984

On Sun, Apr 21, 2013 at 1:36 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 20 Apr 2013 18:37:00 -0700, rusi wrote:
>
>> According to jmf python sucks up to ASCII (those big bad Americans… of
>> whom Steven is the first…)
>
> Watch who you're calling an American, mate.

I think he knows, and that's why he said it. You and I are foremost
among Americans who are destroying Python.

ChrisA

[toc] | [prev] | [next] | [standalone]

#43994

From	Terry Jan Reedy <tjreedy@udel.edu>
Date	2013-04-21 05:02 -0400
Message-ID	<mailman.873.1366534952.3114.python-list@python.org>
In reply to	#43981

On 4/20/2013 9:37 PM, rusi wrote:

> I believe that the recent correction in unicode performance followed
> jmf's grumbles

No, the correction followed upon his accurate report of a regression, 
last August, which was unfortunately mixed in with grumbles and 
inaccurate claims. Others separated out and verified the accurate 
report. I reported it to pydev and enquired as to its necessity, I 
believe Mark opened the tracker issue, and the two people who worked on 
optimizing 3.3 a year ago fairly quickly came up with two different 
patches. The several month delay after was a matter of testing and 
picking the best approach.

[toc] | [prev] | [next] | [standalone]

#43997

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-04-21 13:03 +0100
Message-ID	<mailman.875.1366545845.3114.python-list@python.org>
In reply to	#43981

On 21/04/2013 10:02, Terry Jan Reedy wrote:
> On 4/20/2013 9:37 PM, rusi wrote:
>
>> I believe that the recent correction in unicode performance followed
>> jmf's grumbles
>
> No, the correction followed upon his accurate report of a regression,
> last August, which was unfortunately mixed in with grumbles and
> inaccurate claims. Others separated out and verified the accurate
> report. I reported it to pydev and enquired as to its necessity, I
> believe Mark opened the tracker issue, and the two people who worked on
> optimizing 3.3 a year ago fairly quickly came up with two different
> patches. The several month delay after was a matter of testing and
> picking the best approach.
>
>

I'd again like to point out that all I did was raise the issue.  It was 
based on data provided by Steven D'Aprano and confirmed by Serhiy Storchaka.

-- 
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#43980

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-04-20 18:06 -0700
Message-ID	<mailman.870.1366507668.3114.python-list@python.org>
In reply to	#43958

On 04/20/2013 11:14 AM, Chris Angelico wrote:
> Flash forward to current date, and jmf has hijacked so many threads to
> moan about PEP 393 that I'm actually happy about this one, simply
> because he gave it a new subject line and one appropriate to a
> discussion about Unicode.

+1000

[toc] | [prev] | [next] | [standalone]

#43986

From	88888 Dihedral <dihedral88888@googlemail.com>
Date	2013-04-20 23:09 -0700
Message-ID	<2524ede0-5c8b-4f82-9fba-f3d6c31a7320@googlegroups.com>
In reply to	#43958

jmfauth於 2013年4月21日星期日UTC+8上午1時12分43秒寫道：
> In a previous post,
> 
> 
> 
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
> 
> ,
> 
> 
> 
> Chris “Kwpolska” Warrick wrote:
> 
> 
> 
> “Is Unicode support so hard, especially in the 21st century?”
> 
> 
> 
> --
> 
> 
> 
> Unicode is not really complicate and it works very well (more
> 
> than two decades of development if you take into account
> 
> iso-14****).
> 
> 
> 
> But, - I can say, "as usual" - people prefer to spend their
> 
> time to make a "better Unicode than Unicode" and it usually
> 
> fails. Python does not escape to this rule.
> 
> 
> 
> -----
> 
> 
> 
> I'm "busy" with TeX (unicode engine variant), fonts and typography.
> 
> This gives me plenty of ideas to test the "flexible string
> 
> representation" (FSR). I should recognize this FSR is failing
> 
> particulary very well...
> 
> 
> 
> I can almost say, a delight.
> 
> 
> 
> jmf
> 
> Unicode lover

To support the unicode is easy in the language part.
But to support the unicode in a platform involves
the OS and the display and input hardware devices 
which are not suitable to be free most of the time.

[toc] | [prev] | [standalone]

csiph-web

Is Unicode support so hard...

Contents

#43958 — Is Unicode support so hard...

#43959

#43961

#43962

#43963

#43964

#43971

#43981

#43984

#43985

#43994

#43997

#43980

#43986