Groups > comp.lang.python > #60781 > unrolled thread

Python Unicode handling wins again -- mostly

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2013-11-30 00:44 +0000
Last post	2013-12-04 14:38 +0000
Articles	20 on this page of 76 — 22 participants

Back to article view | Back to comp.lang.python

  Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 00:44 +0000
    Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 01:07 +0000
    Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:08 -0500
      Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-11-30 13:12 +1100
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:28 -0500
          Re: Python Unicode handling wins again -- mostly Dave Angel <davea@davea.name> - 2013-11-29 22:06 -0500
      Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 04:21 +0000
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 23:30 -0500
        Re: Python Unicode handling wins again -- mostly Zero Piraeus <z@etiol.net> - 2013-11-30 02:05 -0300
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 06:25 +0000
        Re: Python Unicode handling wins again -- mostly Gene Heskett <gheskett@wdtv.com> - 2013-11-30 00:25 -0500
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 00:37 -0500
          Re: Python Unicode handling wins again -- mostly Ian Kelly <ian.g.kelly@gmail.com> - 2013-11-29 23:00 -0700
            Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:11 +0000
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:41 +0000
            Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:41 +1300
      Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 08:07 +0000
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-11-30 11:11 -0800
        Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:37 +1300
          Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-11-30 18:07 -0500
            Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 08:57 -0800
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:22 +0000
            Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 18:52 -0600
              Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:54 +0000
                Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 19:05 -0600
                Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:13 +1100
                  Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 20:27 -0500
                    Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:31 +1100
    Re: Python Unicode handling wins again -- mostly Serhiy Storchaka <storchaka@gmail.com> - 2013-12-01 20:00 +0200
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 12:15 -0800
        Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 07:54 +1100
          Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-02 04:39 -0800
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 14:46 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:22 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 15:45 +0000
            Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 02:49 +1100
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:58 -0500
            Re: Python Unicode handling wins again -- mostly Terry Reedy <tjreedy@udel.edu> - 2013-12-02 15:26 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 20:45 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:44 -0500
            Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:25 -0800
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:04 +0000
              Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Roy Smith <roy@panix.com> - 2013-12-02 20:38 -0500
                Pythonista Goals  [was Re: Code of Conduct, Trolls, and Thankless Jobs] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 17:56 -0800
                Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:32 +0000
                  Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:41 +0000
                  Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:14 +0000
                Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:11 +0000
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:23 -0500
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:24 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:32 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:53 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 10:11 +1100
            Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 14:41 -0800
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:22 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:39 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 20:11 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 22:06 +0000
        Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 09:29 +1100
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 23:10 +0000
        Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-01 14:50 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 00:43 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 12:38 -0800
    Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:14 -0500
      Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:06 +0000
        Re: Python Unicode handling wins again -- mostly joe <joeedh@gmail.com> - 2013-12-02 23:35 -0800
        Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-03 10:34 -0800
    Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 08:23 +1100
    Re: Python Unicode handling wins again -- mostly MRAB <python@mrabarnett.plus.com> - 2013-12-02 21:27 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:27 -0800
    Re: Python Unicode handling wins again -- mostly Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 09:56 +1100
    Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-03 13:47 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-03 06:26 -0800
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-04 05:52 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 14:07 +0000
        Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:38 +0000

Page 1 of 4 [1] 2 3 4 Next page →

#60781 — Python Unicode handling wins again -- mostly

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-30 00:44 +0000
Subject	Python Unicode handling wins again -- mostly
Message-ID	<529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>

There's a recent blog post complaining about the lousy support for 
Unicode text in most programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

The author, Mortoray, gives nine basic tests to understand how well the 
string type in a language works. The first four involve "user-perceived 
characters", also known as grapheme clusters.


(1) Does the decomposed string "noe\u0308l" print correctly? Notice that 
the accented letter ë has been decomposed into a pair of code points, 
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).

Python 3.3 passes this test:

py> print("noe\u0308l")
noël

although I expect that depends on the terminal you are running in.


(2) If you reverse that string, does it give "lëon"? The implication of 
this question is that strings should operate on grapheme clusters rather 
than code points. Python fails this test:

py> print("noe\u0308l"[::-1])
leon

Some terminals may display the umlaut over the l, or following the l.

I'm not completely sure it is fair to expect a string type to operate on 
grapheme clusters (collections of decomposed characters) as the author 
expects. I think that is going above and beyond what a basic string type 
should be expected to do. I would expect a solid Unicode implementation 
to include support for grapheme clusters, and in that regard Python is 
lacking functionality.


(3) What are the first three characters? The author suggests that the 
answer should be "noë", in which case Python fails again:

py> print("noe\u0308l"[:3])
noe

but again I'm not convinced that slicing should operate across decomposed 
strings in this way. Surely the point of decomposing the string like that 
is in order to count the base character e and the accent "\u0308" 
separately?


(4) Likewise, what is the length of the decomposed string? The author 
expects 4, but Python gives 5:

py> len("noe\u0308l")
5

So far, Python passes only one of the four tests, but I'm not convinced 
that the three failed tests are fair for a string type. If strings 
operated on grapheme clusters, these would be good tests, but it is not a 
given that strings should.

The next few tests have to do with characters in the Supplementary 
Multilingual Planes, and this is where Python 3.3 shines. (In older 
versions, wide builds would also pass, but narrow builds would fail.)

(5) What is the length of "😸😾"?

Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E 
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means 
they require more than two bytes each. Most programming languages using 
UTF-16 encodings internally (including Javascript and Java) fail this 
test. Python 3.3 passes:

py> s = '😸😾'
py> len(s)
2

(Older versions of Python distinguished between *narrow builds*, which 
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow 
builds would also fail this test.)

This makes Python one of a very few programming languages which can 
easily handle so-called "astral characters" from the Supplementary 
Multilingual Planes while still having O(1) indexing operations.


(6) What is the substring after the first character? The right answer is 
a single character POUTING CAT FACE, and Python gets that correct:

py> unicodedata.name(s[1:])
'POUTING CAT FACE'

UTF-16 languages invariable end up with broken, invalid strings 
containing half of a surrogate pair.


(7) What is the reverse of the string? 

Python passes this test too:

py> print(s[::-1])
😾😸
py> for c in s[::-1]:
...     unicodedata.name(c)
...
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'

UTF-16 based languages typically break, again getting invalid strings 
containing surrogate pairs in the wrong order.


The next test involves ligatures. Ligatures are pairs, or triples, of 
characters which have been moved closer together in order to look better. 
Normally you would expect the type-setter to handle ligatures by 
adjusting the spacing between characters, but there are a few pairs (such 
as "fi" <=> "ﬁ" where type designers provided them as custom-designed 
single characters, and Unicode includes them as legacy characters.

(8) What's the uppercase of "baffle" spelled with an ffl ligature?

Like most other languages, Python 3.2 fails:

py> 'baﬄe'.upper()
'BAﬄE'

but Python 3.3 passes:

py> 'baﬄe'.upper()
'BAFFLE'


Lastly, Mortoray returns to noël, and compares the composed and 
decomposed versions of the string:

(9) Does "noël" equal "noe\u0308l"?

Python (correctly, in my opinion) reports that they do not:

py> "noël" == "noe\u0308l"
False

Again, one might argue whether a string type should report these as equal 
or not, I believe Python is doing the right thing here. As the author 
points out, any decent Unicode-aware language should at least offer the 
ability to convert between normalisation forms, and Python passes this 
test:

py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True


Out of the nine tests, Python 3.3 passes six, with three tests being 
failures or dubious. If you believe that the native string type should 
operate on code-points, then you'll think that Python does the right 
thing. If you think it should operate on grapheme clusters, as the author 
of the blog post does, then you'll think Python fails those three tests.


A call to arms
==============

As the Unicode Consortium itself acknowledges, sometimes you want to 
operate on an array of code points, and sometimes on an array of 
graphemes ("user-perceived characters"). Python 3.3 is now halfway there, 
having excellent support for code-points across the entire Unicode 
character set, not just the BMP.

The next step is to provide either a data type, or a library, for working 
on grapheme clusters. The Unicode Consortium provides a detailed 
discussion of this issue here:

http://www.unicode.org/reports/tr29/

If anyone is looking for a meaty project to work on, providing support 
for grapheme clusters could be it. And if not, hopefully you've learned 
something about Unicode and the limitations of Python's Unicode support.


-- 
Steven

[toc] | [next] | [standalone]

#60787

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-30 01:07 +0000
Message-ID	<mailman.3414.1385773668.18130.python-list@python.org>
In reply to	#60781

On 30/11/2013 00:44, Steven D'Aprano wrote:
>
> (5) What is the length of "😸😾"?
>
> Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
> (POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
> they require more than two bytes each. Most programming languages using
> UTF-16 encodings internally (including Javascript and Java) fail this
> test. Python 3.3 passes:
>
> py> s = '😸😾'
> py> len(s)
> 2
>

I couldn't care less if it passes, it's too slow and uses too much 
memory[1], so please get the completely bug ridden Python 2 unicode 
implementation restored at the earliest possible opportunity :)

[1]because I say so although I don't actually have any evidence to 
support my case. :) :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60790

From	Roy Smith <roy@panix.com>
Date	2013-11-29 21:08 -0500
Message-ID	<roy-89763C.21084929112013@news.panix.com>
In reply to	#60781

In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
> 
> Like most other languages, Python 3.2 fails:
> 
> py> 'baffle'.upper()
> 'BAfflE'
> 
> but Python 3.3 passes:
> 
> py> 'baffle'.upper()
> 'BAFFLE'

I disagree.

The whole idea of ligatures like fi is purely typographic.  The crossbar 
on the "f" (at least in some fonts) runs into the dot on the "i".  
Likewise, the top curl on an "f" run into the serif on top of the "l" 
(and similarly for ffl).

There is no such thing as a "FFL" ligature, because the upper case 
letterforms don't run into each other like the lower case ones do.  
Thus, I would argue that it's wrong to say that calling upper() on an 
ffl ligature should yield FFL.

I would certainly expect, x.lower() == x.upper().lower(), to be True for 
all values of x over the set of valid unicode codepoints.  Having 
u"\uFB04".upper() ==> "FFL" breaks that.  I would also expect len(x) == 
len(x.upper()) to be True.

[toc] | [prev] | [next] | [standalone]

#60791

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-30 13:12 +1100
Message-ID	<mailman.3417.1385777557.18130.python-list@python.org>
In reply to	#60790

On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote:
> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.  Having
> u"\uFB04".upper() ==> "FFL" breaks that.  I would also expect len(x) ==
> len(x.upper()) to be True.

That's a nice theory, but the Unicode consortium disagrees with you on
both points.

ChrisA

[toc] | [prev] | [next] | [standalone]

#60792

From	Roy Smith <roy@panix.com>
Date	2013-11-29 21:28 -0500
Message-ID	<roy-F25011.21284729112013@news.panix.com>
In reply to	#60791

In article <mailman.3417.1385777557.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote:
> > I would certainly expect, x.lower() == x.upper().lower(), to be True for
> > all values of x over the set of valid unicode codepoints.  Having
> > u"\uFB04".upper() ==> "FFL" breaks that.  I would also expect len(x) ==
> > len(x.upper()) to be True.
> 
> That's a nice theory, but the Unicode consortium disagrees with you on
> both points.
> 
> ChrisA

Harumph.

[toc] | [prev] | [next] | [standalone]

#60793

From	Dave Angel <davea@davea.name>
Date	2013-11-29 22:06 -0500
Message-ID	<mailman.3418.1385780739.18130.python-list@python.org>
In reply to	#60792

On Fri, 29 Nov 2013 21:28:47 -0500, Roy Smith <roy@panix.com> wrote:
> In article <mailman.3417.1385777557.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> > On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote:
> > > I would certainly expect, x.lower() == x.upper().lower(), to be 
True for
> > > all values of x over the set of valid unicode codepoints.  
Having
> > > u"\uFB04".upper() ==> "FFL" breaks that.  I would also expect 
len(x) ==
> > > len(x.upper()) to be True.

> > That's a nice theory, but the Unicode consortium disagrees with 
you on
> > both points.

And they were already false long before Unicode.  I don’t know 
specifics but there are many cases where there are no uppercase 
equivalents for a particular lowercase character.  And others where 
the uppercase equivalent takes multiple characters.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#60794

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-30 04:21 +0000
Message-ID	<529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#60790

On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:

> In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> 
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>> 
>> Like most other languages, Python 3.2 fails:
>> 
>> py> 'baffle'.upper()
>> 'BAfflE'

You edited my text to remove the ligature? That's... unfortunate.

>> but Python 3.3 passes:
>> 
>> py> 'baffle'.upper()
>> 'BAFFLE'
> 
> I disagree.
> 
> The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to all 
languages that have ligatures. It also partly depends on how you define 
ligatures. For example, would you consider that ampersand & to be a 
ligature? These days, I would consider & to be a distinct character, but 
originally it began as a ligature for "et" (Latin for "and").

But let's skip such corner cases, as they provide much heat but no 
illumination, and I'll agree that when it comes to ligatures like fl, fi 
and ffl, they are purely typographic.

> The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
> 
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do. Thus,
> I would argue that it's wrong to say that calling upper() on an ffl
> ligature should yield FFL.

Your conclusion doesn't follow from the argument you are making. Since 
the ffl ligature ﬄ is purely a typographical feature, then it should 
uppercase to FFL (there being no typographic feature for uppercase FFL 
ligature).

Consider the examples shown above, where you or your software 
unfortunately edited out the ligature and replaced it with ASCII "ffl". 
Or perhaps I should say *fortunately*, since it demonstrates the problem.

Since we agree that the ﬄ ligature is merely a typographic artifact of 
some type-designers whimsy, we can expect that the word "baﬄe" is 
semantically exactly the same as the word "baffle". How foolish Python 
would look if it did this:

py> 'baffle'.upper()
'BAfflE'

Replace the 'ffl' with the ligature, and the conclusion remains:

py> 'baﬄe'.upper()
'BAﬄE'

would be equally wrong.

Now, I accept that this picture isn't entirely black and white. For 
example, we might argue that if ﬄ is purely typographical in nature, 
surely we would also want 'baffle' == 'baﬄe' too? Or maybe not. This 
indicates that capturing *all* the rules for text across the many 
languages, writing systems and conventions is impossible.

There are some circumstances where we would want 'baffle' and 'baﬄe' to 
compare equal, and others where we would want them to compare the same. 
Python gives us both:

py> "bapy> "baffle" == "baﬄe"
False
ffle" == unicodedata.normalize("NFKC", "baﬄe")
True

but frankly I'm baffled *wink* that you think there are any circumstances 
where you would want the uppercase of ﬄ to be anything but FFL.

> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.

You would expect wrongly. You are over-generalising from English, and if 
you include ligatures and other special cases, not even all of English.

See, for example:

http://www.unicode.org/faq/casemap_charprop.html#7a

Apart from ligatures, some examples of troublesome characters with regard 
to case are:

* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending 
  on context, particular when dealing with placenames and family names.

  (That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
  least the 1930s, although the official rules of German orthography
  still insist on uppercasing ß to SS.)

* The English long-s ſ is uppercased to regular S.

* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
  letters I and i but the case conversion rules are different.

* Both the Greek sigma σ and final sigma ς uppercase to Σ.

That last one is especially interesting: Python 3.3 gets it right, while 
older Pythons do not. In Python 3.2:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύσ (Odysseus)'

while in 3.3 it roundtrips correctly:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύς (Odysseus)'

So... case conversions are not as simple as they appear at first glance. 
They aren't always reversible, nor do they always roundtrip. Titlecase is 
not necessarily the same as "uppercase the first letter and lowercase the 
rest". Case conversions can be context or locale sensitive.

Anyway... even if you disagree with everything I have said, it is a fact 
that Python has committed to following the Unicode standard, and the 
Unicode standard requires that certain ligatures, including FFL, FL and 
FI, are decomposed when converted to uppercase.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#60795

From	Roy Smith <roy@panix.com>
Date	2013-11-29 23:30 -0500
Message-ID	<roy-BA1876.23302229112013@news.panix.com>
In reply to	#60794

In article <529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> You edited my text to remove the ligature? That's... unfortunate.

It was un-ligated by the time it reached me.

[toc] | [prev] | [next] | [standalone]

#60796

From	Zero Piraeus <z@etiol.net>
Date	2013-11-30 02:05 -0300
Message-ID	<mailman.3419.1385787973.18130.python-list@python.org>
In reply to	#60794

:

On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote:
> On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> > The whole idea of ligatures like fi is purely typographic.
> 
> In English, that's correct. I'm not sure if we can generalise that to
> all languages that have ligatures. It also partly depends on how you
> define ligatures. For example, would you consider that ampersand & to
> be a ligature? These days, I would consider & to be a distinct
> character, but originally it began as a ligature for "et" (Latin for
> "and").
> 
> But let's skip such corner cases, as they provide much heat but no 
> illumination, [...]

In the interest of warmth (I know it's winter in some parts of the
world) ...

As I understand it, "&" has always been used to replace the word "et"
specifically, rather than the letter-pair e,t (no-one has ever written
"k&tle" other than ironically), which makes it a logogram rather than a
ligature (like "@").

(I happen to think the presence of ligatures in Unicode is insane, but
my dictator-of-the-world certificate appears to have gotten lost in the
post, so fixing that will have to wait).

 -[]z.

-- 
Zero Piraeus: inter caetera
http://etiol.net/pubkey.asc

[toc] | [prev] | [next] | [standalone]

#60801

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-30 06:25 +0000
Message-ID	<529984ed$0$29993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#60796

On Sat, 30 Nov 2013 02:05:59 -0300, Zero Piraeus wrote:

> (I happen to think the presence of ligatures in Unicode is insane, but
> my dictator-of-the-world certificate appears to have gotten lost in the
> post, so fixing that will have to wait).

You're probably right, but we live in an insane world of dozens of insane 
legacy encodings, and part of the mission of Unicode is to include every 
single character that those legacy encodings did. Since some of them 
included ligatures, so must Unicode. Sad but true.

(Unicode is intended as a replacement for the insanity of dozens of 
multiply incompatible character sets. It cannot hope to replace them if 
it cannot represent every distinct character they represent.)

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#60797

From	Gene Heskett <gheskett@wdtv.com>
Date	2013-11-30 00:25 -0500
Message-ID	<mailman.3420.1385789161.18130.python-list@python.org>
In reply to	#60794

On Saturday 30 November 2013 00:23:22 Zero Piraeus did opine:

> On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote:
> > On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> > > The whole idea of ligatures like fi is purely typographic.
> > 
> > In English, that's correct. I'm not sure if we can generalise that to
> > all languages that have ligatures. It also partly depends on how you
> > define ligatures. For example, would you consider that ampersand & to
> > be a ligature? These days, I would consider & to be a distinct
> > character, but originally it began as a ligature for "et" (Latin for
> > "and").
> > 
> > But let's skip such corner cases, as they provide much heat but no
> > illumination, [...]
> 
> In the interest of warmth (I know it's winter in some parts of the
> world) ...
> 
> As I understand it, "&" has always been used to replace the word "et"
> specifically, rather than the letter-pair e,t (no-one has ever written
> "k&tle" other than ironically), which makes it a logogram rather than a
> ligature (like "@").

Whereas in these here parts, the "&" has always been read as a single 
character shortcut for the word "and".
> 
> (I happen to think the presence of ligatures in Unicode is insane, but
> my dictator-of-the-world certificate appears to have gotten lost in the
> post, so fixing that will have to wait).
> 
>  -[]z.


Cheers, Gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

"I remember when I was a kid I used to come home from Sunday School and
 my mother would get drunk and try to make pancakes."
		-- George Carlin
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
         law-abiding citizens.

[toc] | [prev] | [next] | [standalone]

#60798

From	Roy Smith <roy@panix.com>
Date	2013-11-30 00:37 -0500
Message-ID	<roy-0E67F5.00371730112013@news.panix.com>
In reply to	#60794

In article <529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> > The whole idea of ligatures like fi is purely typographic.
> 
> In English, that's correct. I'm not sure if we can generalise that to all 
> languages that have ligatures. It also partly depends on how you define 
> ligatures.

I was speaking specifically of "ligatures like fi" (or, if you prefer, 
"ligatures like ό".  By which I mean those things printers invented 
because some letter combinations look funny when typeset as two distinct 
letters.

There are other kinds of ligatures.  For example, œ is a dipthong.  It 
makes sense (well, to me, anyway) that upper case œ is Έ.

Well, anyway, that's the truth according to me.  Apparently the Unicode 
Consortium disagrees.  So, who am I to argue with the people who decided 
that I needed to be able to type a "PILE OF POO" character.  Which, by 
the way, I can find in my "Character Viewer" input helper, but which MT 
Newswatcher doesn't appear to be willing to insert into text.  I guess 
Basic Multilingual Poo would have been OK but Astral Poo is too much for 
it.

[toc] | [prev] | [next] | [standalone]

#60800

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-11-29 23:00 -0700
Message-ID	<mailman.3421.1385791271.18130.python-list@python.org>
In reply to	#60798

On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy@panix.com> wrote:
> I was speaking specifically of "ligatures like fi" (or, if you prefer,
> "ligatures like ό".  By which I mean those things printers invented
> because some letter combinations look funny when typeset as two distinct
> letters.

I think the encoding of your email is incorrect, because GREEK SMALL
LETTER OMICRON WITH TONOS is not a ligature.

> There are other kinds of ligatures.  For example, oe is a dipthong.  It
> makes sense (well, to me, anyway) that upper case oe is Έ.

As above. I can't fathom why would it make sense for the upper case of
LATIN SMALL LIGATURE OE to be GREEK CAPITAL LETTER EPSILON WITH TONOS.

[toc] | [prev] | [next] | [standalone]

#60802

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-30 07:11 +0000
Message-ID	<52998fbf$0$29993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#60800

On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote:

> On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy@panix.com> wrote:
>> I was speaking specifically of "ligatures like fi" (or, if you prefer,
>> "ligatures like ό".  By which I mean those things printers invented
>> because some letter combinations look funny when typeset as two
>> distinct letters.
> 
> I think the encoding of your email is incorrect, because GREEK SMALL
> LETTER OMICRON WITH TONOS is not a ligature.

Roy's post, which is sent via Usenet not email, doesn't have an encoding 
set. Since he's sending from a Mac, his software may believe that the 
entire universe understands the Mac Roman encoding, which makes a certain 
amount of sense since if I recall correctly the fi and fl ligatures 
originally appeared in early Mac fonts. 

I'm going to give Roy the benefit of the doubt and assume he actually 
entered the fi ligature at his end. If his software was using Mac Roman, 
it would insert a single byte DE into the message:

py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'

But that's not what his post includes. The message actually includes two 
bytes CF8C, in other words:

'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'

Since nearly all of his post is in single bytes, it's some variable-width 
encoding, but not UTF-8.

With no encoding set, our newsreader software starts off assuming that 
the post uses UTF-8 ('cos that's the only sensible default), and those 
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.

I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when 
the tools he uses are apparently so broken. But it isn't Unicode's fault, 
its the tools.

The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ﬄ LATIN SMALL LIGATURE FFL 
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the 
ASCII letters "ffl". That's astonishingly weird.

That is really a bizarre error. I suppose it is not entirely impossible 
that the software is actually being clever rather than dumb. Having 
correctly decoded the UTF-8 bytes, perhaps it realised that there was no 
glyph for the ligature, and rather than display a MISSING CHAR glyph 
(usually one of those empty boxes you sometimes see), it normalized it to 
ASCII. But if it's that clever, why the hell doesn't it set an encoding 
line in posts?????

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#60803

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-30 07:41 +0000
Message-ID	<5299969b$0$29993$c3e8da3$5496439d@news.astraweb.com>
In reply to	#60798

On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote:

> So, who am I to argue with the people who decided that I needed to be
> able to type a "PILE OF POO" character.

Blame the Japanese for that. Apparently some of the biggest users of 
Unicode are the various Japanese mobile phone manufacturers, TV stations, 
map makers and similar. So there's a large number of symbols and emoji 
(emoticons) specifically added for them, presumably because they pay big 
dollars to the Unicode Consortium and therefore get a lot of influence in 
what gets added.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#60810

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2013-12-01 11:41 +1300
Message-ID	<bfv7sqFpr2mU1@mid.individual.net>
In reply to	#60803

Steven D'Aprano wrote:
> On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote:
> 
>>So, who am I to argue with the people who decided that I needed to be
>>able to type a "PILE OF POO" character.
> 
> Blame the Japanese for that.  Apparently some of the biggest users of
> Unicode are the various Japanese mobile phone manufacturers, TV stations, 
> map makers and similar.

Also there's apparently a pun in Japanese involving the
words for 'poo' and 'luck'. So putting a poo symbol in
your text message means 'good luck'. Given that, it's
not *quite* as silly as it seems.

-- 
Best of poo,
Greg

[toc] | [prev] | [next] | [standalone]

#60804

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-30 08:07 +0000
Message-ID	<mailman.3422.1385798876.18130.python-list@python.org>
In reply to	#60790

On 30/11/2013 02:08, Roy Smith wrote:
> In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
>   Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>>
>> Like most other languages, Python 3.2 fails:
>>
>> py> 'baffle'.upper()
>> 'BAfflE'
>>
>> but Python 3.3 passes:
>>
>> py> 'baffle'.upper()
>> 'BAFFLE'
>
> I disagree.
>
> The whole idea of ligatures like fi is purely typographic.  The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
>
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do.
> Thus, I would argue that it's wrong to say that calling upper() on an
> ffl ligature should yield FFL.
>
> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.  Having
> u"\uFB04".upper() ==> "FFL" breaks that.  I would also expect len(x) ==
> len(x.upper()) to be True.
>

http://bugs.python.org/issue19819 talks about these beasties.  Please 
don't come back to me as I haven't got a clue!!!

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60808

From	wxjmfauth@gmail.com
Date	2013-11-30 11:11 -0800
Message-ID	<b26b96f2-8415-4e58-8c47-243e4605b184@googlegroups.com>
In reply to	#60790

Le samedi 30 novembre 2013 03:08:49 UTC+1, Roy Smith a écrit :
> 
> 
> 
> The whole idea of ligatures like fi is purely typographic.  The crossbar 
> 
> on the "f" (at least in some fonts) runs into the dot on the "i".  
> 
> Likewise, the top curl on an "f" run into the serif on top of the "l" 
> 
> (and similarly for ffl).
> 


And do you know the origin of this typographical feature?
Because, mechanically, the dot of the "i" broke too often.

I cann't proof that's the truth, I read this many times in
the literature speaking about typography and about unicode.

In my opinion, a very plausible explanation.

jmf

[toc] | [prev] | [next] | [standalone]

#60809

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2013-12-01 11:37 +1300
Message-ID	<bfv7lbFppgnU1@mid.individual.net>
In reply to	#60808

wxjmfauth@gmail.com wrote:
> And do you know the origin of this typographical feature?
> Because, mechanically, the dot of the "i" broke too often.
> 
> In my opinion, a very plausible explanation.

It doesn't sound very plausible to me, because there
are a lot more stand-alone 'i's in English text than
there are ones following an f. What is there to stop
them from breaking?

It's more likely to be simply a kerning issue. You
want to get the stems of the f and the i close together,
and the only practical way to do that with mechanical
type is to merge them into one piece of metal.

Which makes it even sillier to have an 'ffi' character
in this day and age, when you can simply space the
characters so that they overlap.

-- 
Greg

[toc] | [prev] | [next] | [standalone]

#60812

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-11-30 18:07 -0500
Message-ID	<mailman.3427.1385852868.18130.python-list@python.org>
In reply to	#60809

On 11/30/13 5:37 PM, Gregory Ewing wrote:
> wxjmfauth@gmail.com wrote:
>> And do you know the origin of this typographical feature?
>> Because, mechanically, the dot of the "i" broke too often.
>>
>> In my opinion, a very plausible explanation.
>
> It doesn't sound very plausible to me, because there
> are a lot more stand-alone 'i's in English text than
> there are ones following an f. What is there to stop
> them from breaking?
>
> It's more likely to be simply a kerning issue. You
> want to get the stems of the f and the i close together,
> and the only practical way to do that with mechanical
> type is to merge them into one piece of metal.
>
> Which makes it even sillier to have an 'ffi' character
> in this day and age, when you can simply space the
> characters so that they overlap.
>

The fi ligature was created because visually, an f and i wouldn't work 
well together: the crossbar of the f was near, but not connected to the 
serif of the i, and the terminal bulb of the f was close to, but not 
coincident, with the dot of the i.

This article goes into great detail, and has a good illustration of how 
an f and i can clash, and how an fi ligature can fix the problem: 
http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second 
fi illustration, which demonstrates using a ligature to make the letters 
appear *less* connected than they would individually!

This is also why "simply spacing the characters" isn't a solution: a 
specially designed ligature looks better than a separate f and i, no 
matter how minutely kerned.

It's unfortunate that Unicode includes presentation alternatives like 
the fi (and ff, fl, ffi, and fl) ligatures.  It was done to be a 
superset of existing encodings.

Many typefaces have other non-encoded ligatures as well, especially 
display faces, which also have alternate glyphs.  Unicode is a funny mix 
in that it includes some forms of alternates, but can't include all of 
them, so we have to put up with both an ad-hoc Unicode that includes 
presentational variants, and also some other way to specify variants 
because Unicode can't include all of them.

--Ned.

[toc] | [prev] | [next] | [standalone]

Page 1 of 4 [1] 2 3 4 Next page →

csiph-web

Python Unicode handling wins again -- mostly

Contents

#60781 — Python Unicode handling wins again -- mostly

#60787

#60790

#60791

#60792

#60793

#60794

#60795

#60796

#60801

#60797

#60798

#60800

#60802

#60803

#60810

#60804

#60808

#60809

#60812