Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #60781 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2013-11-30 00:44 +0000 |
| Last post | 2013-12-04 14:38 +0000 |
| Articles | 20 on this page of 76 — 22 participants |
Back to article view | Back to comp.lang.python
Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 00:44 +0000
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 01:07 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:08 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-11-30 13:12 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:28 -0500
Re: Python Unicode handling wins again -- mostly Dave Angel <davea@davea.name> - 2013-11-29 22:06 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 04:21 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 23:30 -0500
Re: Python Unicode handling wins again -- mostly Zero Piraeus <z@etiol.net> - 2013-11-30 02:05 -0300
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 06:25 +0000
Re: Python Unicode handling wins again -- mostly Gene Heskett <gheskett@wdtv.com> - 2013-11-30 00:25 -0500
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 00:37 -0500
Re: Python Unicode handling wins again -- mostly Ian Kelly <ian.g.kelly@gmail.com> - 2013-11-29 23:00 -0700
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:11 +0000
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:41 +0000
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:41 +1300
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 08:07 +0000
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-11-30 11:11 -0800
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:37 +1300
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-11-30 18:07 -0500
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 08:57 -0800
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:22 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 18:52 -0600
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:54 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 19:05 -0600
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:13 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 20:27 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:31 +1100
Re: Python Unicode handling wins again -- mostly Serhiy Storchaka <storchaka@gmail.com> - 2013-12-01 20:00 +0200
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 12:15 -0800
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 07:54 +1100
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-02 04:39 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 14:46 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:22 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 15:45 +0000
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 02:49 +1100
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:58 -0500
Re: Python Unicode handling wins again -- mostly Terry Reedy <tjreedy@udel.edu> - 2013-12-02 15:26 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 20:45 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:44 -0500
Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:25 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:04 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Roy Smith <roy@panix.com> - 2013-12-02 20:38 -0500
Pythonista Goals [was Re: Code of Conduct, Trolls, and Thankless Jobs] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 17:56 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:32 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:41 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:14 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:11 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:23 -0500
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:24 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:32 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:53 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 10:11 +1100
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 14:41 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:22 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:39 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 20:11 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 22:06 +0000
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 09:29 +1100
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 23:10 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-01 14:50 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 00:43 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 12:38 -0800
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:14 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:06 +0000
Re: Python Unicode handling wins again -- mostly joe <joeedh@gmail.com> - 2013-12-02 23:35 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-03 10:34 -0800
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 08:23 +1100
Re: Python Unicode handling wins again -- mostly MRAB <python@mrabarnett.plus.com> - 2013-12-02 21:27 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:27 -0800
Re: Python Unicode handling wins again -- mostly Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 09:56 +1100
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-03 13:47 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-03 06:26 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-04 05:52 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 14:07 +0000
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:38 +0000
Page 1 of 4 [1] 2 3 4 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-30 00:44 +0000 |
| Subject | Python Unicode handling wins again -- mostly |
| Message-ID | <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com> |
There's a recent blog post complaining about the lousy support for
Unicode text in most programming languages:
http://mortoray.com/2013/11/27/the-string-type-is-broken/
The author, Mortoray, gives nine basic tests to understand how well the
string type in a language works. The first four involve "user-perceived
characters", also known as grapheme clusters.
(1) Does the decomposed string "noe\u0308l" print correctly? Notice that
the accented letter ë has been decomposed into a pair of code points,
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).
Python 3.3 passes this test:
py> print("noe\u0308l")
noël
although I expect that depends on the terminal you are running in.
(2) If you reverse that string, does it give "lëon"? The implication of
this question is that strings should operate on grapheme clusters rather
than code points. Python fails this test:
py> print("noe\u0308l"[::-1])
leon
Some terminals may display the umlaut over the l, or following the l.
I'm not completely sure it is fair to expect a string type to operate on
grapheme clusters (collections of decomposed characters) as the author
expects. I think that is going above and beyond what a basic string type
should be expected to do. I would expect a solid Unicode implementation
to include support for grapheme clusters, and in that regard Python is
lacking functionality.
(3) What are the first three characters? The author suggests that the
answer should be "noë", in which case Python fails again:
py> print("noe\u0308l"[:3])
noe
but again I'm not convinced that slicing should operate across decomposed
strings in this way. Surely the point of decomposing the string like that
is in order to count the base character e and the accent "\u0308"
separately?
(4) Likewise, what is the length of the decomposed string? The author
expects 4, but Python gives 5:
py> len("noe\u0308l")
5
So far, Python passes only one of the four tests, but I'm not convinced
that the three failed tests are fair for a string type. If strings
operated on grapheme clusters, these would be good tests, but it is not a
given that strings should.
The next few tests have to do with characters in the Supplementary
Multilingual Planes, and this is where Python 3.3 shines. (In older
versions, wide builds would also pass, but narrow builds would fail.)
(5) What is the length of "😸😾"?
Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
they require more than two bytes each. Most programming languages using
UTF-16 encodings internally (including Javascript and Java) fail this
test. Python 3.3 passes:
py> s = '😸😾'
py> len(s)
2
(Older versions of Python distinguished between *narrow builds*, which
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow
builds would also fail this test.)
This makes Python one of a very few programming languages which can
easily handle so-called "astral characters" from the Supplementary
Multilingual Planes while still having O(1) indexing operations.
(6) What is the substring after the first character? The right answer is
a single character POUTING CAT FACE, and Python gets that correct:
py> unicodedata.name(s[1:])
'POUTING CAT FACE'
UTF-16 languages invariable end up with broken, invalid strings
containing half of a surrogate pair.
(7) What is the reverse of the string?
Python passes this test too:
py> print(s[::-1])
😾😸
py> for c in s[::-1]:
... unicodedata.name(c)
...
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'
UTF-16 based languages typically break, again getting invalid strings
containing surrogate pairs in the wrong order.
The next test involves ligatures. Ligatures are pairs, or triples, of
characters which have been moved closer together in order to look better.
Normally you would expect the type-setter to handle ligatures by
adjusting the spacing between characters, but there are a few pairs (such
as "fi" <=> "fi" where type designers provided them as custom-designed
single characters, and Unicode includes them as legacy characters.
(8) What's the uppercase of "baffle" spelled with an ffl ligature?
Like most other languages, Python 3.2 fails:
py> 'baffle'.upper()
'BAfflE'
but Python 3.3 passes:
py> 'baffle'.upper()
'BAFFLE'
Lastly, Mortoray returns to noël, and compares the composed and
decomposed versions of the string:
(9) Does "noël" equal "noe\u0308l"?
Python (correctly, in my opinion) reports that they do not:
py> "noël" == "noe\u0308l"
False
Again, one might argue whether a string type should report these as equal
or not, I believe Python is doing the right thing here. As the author
points out, any decent Unicode-aware language should at least offer the
ability to convert between normalisation forms, and Python passes this
test:
py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True
Out of the nine tests, Python 3.3 passes six, with three tests being
failures or dubious. If you believe that the native string type should
operate on code-points, then you'll think that Python does the right
thing. If you think it should operate on grapheme clusters, as the author
of the blog post does, then you'll think Python fails those three tests.
A call to arms
==============
As the Unicode Consortium itself acknowledges, sometimes you want to
operate on an array of code points, and sometimes on an array of
graphemes ("user-perceived characters"). Python 3.3 is now halfway there,
having excellent support for code-points across the entire Unicode
character set, not just the BMP.
The next step is to provide either a data type, or a library, for working
on grapheme clusters. The Unicode Consortium provides a detailed
discussion of this issue here:
http://www.unicode.org/reports/tr29/
If anyone is looking for a meaty project to work on, providing support
for grapheme clusters could be it. And if not, hopefully you've learned
something about Unicode and the limitations of Python's Unicode support.
--
Steven
[toc] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-11-30 01:07 +0000 |
| Message-ID | <mailman.3414.1385773668.18130.python-list@python.org> |
| In reply to | #60781 |
On 30/11/2013 00:44, Steven D'Aprano wrote: > > (5) What is the length of "😸😾"? > > Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E > (POUTING CAT FACE) are outside the Basic Multilingual Plane, which means > they require more than two bytes each. Most programming languages using > UTF-16 encodings internally (including Javascript and Java) fail this > test. Python 3.3 passes: > > py> s = '😸😾' > py> len(s) > 2 > I couldn't care less if it passes, it's too slow and uses too much memory[1], so please get the completely bug ridden Python 2 unicode implementation restored at the earliest possible opportunity :) [1]because I say so although I don't actually have any evidence to support my case. :) :) -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-29 21:08 -0500 |
| Message-ID | <roy-89763C.21084929112013@news.panix.com> |
| In reply to | #60781 |
In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > (8) What's the uppercase of "baffle" spelled with an ffl ligature? > > Like most other languages, Python 3.2 fails: > > py> 'baffle'.upper() > 'BAfflE' > > but Python 3.3 passes: > > py> 'baffle'.upper() > 'BAFFLE' I disagree. The whole idea of ligatures like fi is purely typographic. The crossbar on the "f" (at least in some fonts) runs into the dot on the "i". Likewise, the top curl on an "f" run into the serif on top of the "l" (and similarly for ffl). There is no such thing as a "FFL" ligature, because the upper case letterforms don't run into each other like the lower case ones do. Thus, I would argue that it's wrong to say that calling upper() on an ffl ligature should yield FFL. I would certainly expect, x.lower() == x.upper().lower(), to be True for all values of x over the set of valid unicode codepoints. Having u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) == len(x.upper()) to be True.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-11-30 13:12 +1100 |
| Message-ID | <mailman.3417.1385777557.18130.python-list@python.org> |
| In reply to | #60790 |
On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote: > I would certainly expect, x.lower() == x.upper().lower(), to be True for > all values of x over the set of valid unicode codepoints. Having > u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) == > len(x.upper()) to be True. That's a nice theory, but the Unicode consortium disagrees with you on both points. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-29 21:28 -0500 |
| Message-ID | <roy-F25011.21284729112013@news.panix.com> |
| In reply to | #60791 |
In article <mailman.3417.1385777557.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote: > > I would certainly expect, x.lower() == x.upper().lower(), to be True for > > all values of x over the set of valid unicode codepoints. Having > > u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) == > > len(x.upper()) to be True. > > That's a nice theory, but the Unicode consortium disagrees with you on > both points. > > ChrisA Harumph.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-11-29 22:06 -0500 |
| Message-ID | <mailman.3418.1385780739.18130.python-list@python.org> |
| In reply to | #60792 |
On Fri, 29 Nov 2013 21:28:47 -0500, Roy Smith <roy@panix.com> wrote: > In article <mailman.3417.1385777557.18130.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > > On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <roy@panix.com> wrote: > > > I would certainly expect, x.lower() == x.upper().lower(), to be True for > > > all values of x over the set of valid unicode codepoints. Having > > > u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) == > > > len(x.upper()) to be True. > > That's a nice theory, but the Unicode consortium disagrees with you on > > both points. And they were already false long before Unicode. I don’t know specifics but there are many cases where there are no uppercase equivalents for a particular lowercase character. And others where the uppercase equivalent takes multiple characters. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-30 04:21 +0000 |
| Message-ID | <529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #60790 |
On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>,
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>>
>> Like most other languages, Python 3.2 fails:
>>
>> py> 'baffle'.upper()
>> 'BAfflE'
You edited my text to remove the ligature? That's... unfortunate.
>> but Python 3.3 passes:
>>
>> py> 'baffle'.upper()
>> 'BAFFLE'
>
> I disagree.
>
> The whole idea of ligatures like fi is purely typographic.
In English, that's correct. I'm not sure if we can generalise that to all
languages that have ligatures. It also partly depends on how you define
ligatures. For example, would you consider that ampersand & to be a
ligature? These days, I would consider & to be a distinct character, but
originally it began as a ligature for "et" (Latin for "and").
But let's skip such corner cases, as they provide much heat but no
illumination, and I'll agree that when it comes to ligatures like fl, fi
and ffl, they are purely typographic.
> The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
>
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do. Thus,
> I would argue that it's wrong to say that calling upper() on an ffl
> ligature should yield FFL.
Your conclusion doesn't follow from the argument you are making. Since
the ffl ligature ffl is purely a typographical feature, then it should
uppercase to FFL (there being no typographic feature for uppercase FFL
ligature).
Consider the examples shown above, where you or your software
unfortunately edited out the ligature and replaced it with ASCII "ffl".
Or perhaps I should say *fortunately*, since it demonstrates the problem.
Since we agree that the ffl ligature is merely a typographic artifact of
some type-designers whimsy, we can expect that the word "baffle" is
semantically exactly the same as the word "baffle". How foolish Python
would look if it did this:
py> 'baffle'.upper()
'BAfflE'
Replace the 'ffl' with the ligature, and the conclusion remains:
py> 'baffle'.upper()
'BAfflE'
would be equally wrong.
Now, I accept that this picture isn't entirely black and white. For
example, we might argue that if ffl is purely typographical in nature,
surely we would also want 'baffle' == 'baffle' too? Or maybe not. This
indicates that capturing *all* the rules for text across the many
languages, writing systems and conventions is impossible.
There are some circumstances where we would want 'baffle' and 'baffle' to
compare equal, and others where we would want them to compare the same.
Python gives us both:
py> "bapy> "baffle" == "baffle"
False
ffle" == unicodedata.normalize("NFKC", "baffle")
True
but frankly I'm baffled *wink* that you think there are any circumstances
where you would want the uppercase of ffl to be anything but FFL.
> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.
You would expect wrongly. You are over-generalising from English, and if
you include ligatures and other special cases, not even all of English.
See, for example:
http://www.unicode.org/faq/casemap_charprop.html#7a
Apart from ligatures, some examples of troublesome characters with regard
to case are:
* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending
on context, particular when dealing with placenames and family names.
(That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
least the 1930s, although the official rules of German orthography
still insist on uppercasing ß to SS.)
* The English long-s ſ is uppercased to regular S.
* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
letters I and i but the case conversion rules are different.
* Both the Greek sigma σ and final sigma ς uppercase to Σ.
That last one is especially interesting: Python 3.3 gets it right, while
older Pythons do not. In Python 3.2:
py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύσ (Odysseus)'
while in 3.3 it roundtrips correctly:
py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύς (Odysseus)'
So... case conversions are not as simple as they appear at first glance.
They aren't always reversible, nor do they always roundtrip. Titlecase is
not necessarily the same as "uppercase the first letter and lowercase the
rest". Case conversions can be context or locale sensitive.
Anyway... even if you disagree with everything I have said, it is a fact
that Python has committed to following the Unicode standard, and the
Unicode standard requires that certain ligatures, including FFL, FL and
FI, are decomposed when converted to uppercase.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-29 23:30 -0500 |
| Message-ID | <roy-BA1876.23302229112013@news.panix.com> |
| In reply to | #60794 |
In article <529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > You edited my text to remove the ligature? That's... unfortunate. It was un-ligated by the time it reached me.
[toc] | [prev] | [next] | [standalone]
| From | Zero Piraeus <z@etiol.net> |
|---|---|
| Date | 2013-11-30 02:05 -0300 |
| Message-ID | <mailman.3419.1385787973.18130.python-list@python.org> |
| In reply to | #60794 |
: On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote: > On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote: > > The whole idea of ligatures like fi is purely typographic. > > In English, that's correct. I'm not sure if we can generalise that to > all languages that have ligatures. It also partly depends on how you > define ligatures. For example, would you consider that ampersand & to > be a ligature? These days, I would consider & to be a distinct > character, but originally it began as a ligature for "et" (Latin for > "and"). > > But let's skip such corner cases, as they provide much heat but no > illumination, [...] In the interest of warmth (I know it's winter in some parts of the world) ... As I understand it, "&" has always been used to replace the word "et" specifically, rather than the letter-pair e,t (no-one has ever written "k&tle" other than ironically), which makes it a logogram rather than a ligature (like "@"). (I happen to think the presence of ligatures in Unicode is insane, but my dictator-of-the-world certificate appears to have gotten lost in the post, so fixing that will have to wait). -[]z. -- Zero Piraeus: inter caetera http://etiol.net/pubkey.asc
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-30 06:25 +0000 |
| Message-ID | <529984ed$0$29993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #60796 |
On Sat, 30 Nov 2013 02:05:59 -0300, Zero Piraeus wrote: > (I happen to think the presence of ligatures in Unicode is insane, but > my dictator-of-the-world certificate appears to have gotten lost in the > post, so fixing that will have to wait). You're probably right, but we live in an insane world of dozens of insane legacy encodings, and part of the mission of Unicode is to include every single character that those legacy encodings did. Since some of them included ligatures, so must Unicode. Sad but true. (Unicode is intended as a replacement for the insanity of dozens of multiply incompatible character sets. It cannot hope to replace them if it cannot represent every distinct character they represent.) -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Gene Heskett <gheskett@wdtv.com> |
|---|---|
| Date | 2013-11-30 00:25 -0500 |
| Message-ID | <mailman.3420.1385789161.18130.python-list@python.org> |
| In reply to | #60794 |
On Saturday 30 November 2013 00:23:22 Zero Piraeus did opine:
> On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote:
> > On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> > > The whole idea of ligatures like fi is purely typographic.
> >
> > In English, that's correct. I'm not sure if we can generalise that to
> > all languages that have ligatures. It also partly depends on how you
> > define ligatures. For example, would you consider that ampersand & to
> > be a ligature? These days, I would consider & to be a distinct
> > character, but originally it began as a ligature for "et" (Latin for
> > "and").
> >
> > But let's skip such corner cases, as they provide much heat but no
> > illumination, [...]
>
> In the interest of warmth (I know it's winter in some parts of the
> world) ...
>
> As I understand it, "&" has always been used to replace the word "et"
> specifically, rather than the letter-pair e,t (no-one has ever written
> "k&tle" other than ironically), which makes it a logogram rather than a
> ligature (like "@").
Whereas in these here parts, the "&" has always been read as a single
character shortcut for the word "and".
>
> (I happen to think the presence of ligatures in Unicode is insane, but
> my dictator-of-the-world certificate appears to have gotten lost in the
> post, so fixing that will have to wait).
>
> -[]z.
Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>
"I remember when I was a kid I used to come home from Sunday School and
my mother would get drunk and try to make pancakes."
-- George Carlin
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-11-30 00:37 -0500 |
| Message-ID | <roy-0E67F5.00371730112013@news.panix.com> |
| In reply to | #60794 |
In article <529967dc$0$29993$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > > The whole idea of ligatures like fi is purely typographic. > > In English, that's correct. I'm not sure if we can generalise that to all > languages that have ligatures. It also partly depends on how you define > ligatures. I was speaking specifically of "ligatures like fi" (or, if you prefer, "ligatures like ό". By which I mean those things printers invented because some letter combinations look funny when typeset as two distinct letters. There are other kinds of ligatures. For example, œ is a dipthong. It makes sense (well, to me, anyway) that upper case œ is Έ. Well, anyway, that's the truth according to me. Apparently the Unicode Consortium disagrees. So, who am I to argue with the people who decided that I needed to be able to type a "PILE OF POO" character. Which, by the way, I can find in my "Character Viewer" input helper, but which MT Newswatcher doesn't appear to be willing to insert into text. I guess Basic Multilingual Poo would have been OK but Astral Poo is too much for it.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-11-29 23:00 -0700 |
| Message-ID | <mailman.3421.1385791271.18130.python-list@python.org> |
| In reply to | #60798 |
On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy@panix.com> wrote: > I was speaking specifically of "ligatures like fi" (or, if you prefer, > "ligatures like ό". By which I mean those things printers invented > because some letter combinations look funny when typeset as two distinct > letters. I think the encoding of your email is incorrect, because GREEK SMALL LETTER OMICRON WITH TONOS is not a ligature. > There are other kinds of ligatures. For example, oe is a dipthong. It > makes sense (well, to me, anyway) that upper case oe is Έ. As above. I can't fathom why would it make sense for the upper case of LATIN SMALL LIGATURE OE to be GREEK CAPITAL LETTER EPSILON WITH TONOS.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-30 07:11 +0000 |
| Message-ID | <52998fbf$0$29993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #60800 |
On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote:
> On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy@panix.com> wrote:
>> I was speaking specifically of "ligatures like fi" (or, if you prefer,
>> "ligatures like ό". By which I mean those things printers invented
>> because some letter combinations look funny when typeset as two
>> distinct letters.
>
> I think the encoding of your email is incorrect, because GREEK SMALL
> LETTER OMICRON WITH TONOS is not a ligature.
Roy's post, which is sent via Usenet not email, doesn't have an encoding
set. Since he's sending from a Mac, his software may believe that the
entire universe understands the Mac Roman encoding, which makes a certain
amount of sense since if I recall correctly the fi and fl ligatures
originally appeared in early Mac fonts.
I'm going to give Roy the benefit of the doubt and assume he actually
entered the fi ligature at his end. If his software was using Mac Roman,
it would insert a single byte DE into the message:
py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'
But that's not what his post includes. The message actually includes two
bytes CF8C, in other words:
'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'
Since nearly all of his post is in single bytes, it's some variable-width
encoding, but not UTF-8.
With no encoding set, our newsreader software starts off assuming that
the post uses UTF-8 ('cos that's the only sensible default), and those
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.
I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when
the tools he uses are apparently so broken. But it isn't Unicode's fault,
its the tools.
The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ffl LATIN SMALL LIGATURE FFL
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the
ASCII letters "ffl". That's astonishingly weird.
That is really a bizarre error. I suppose it is not entirely impossible
that the software is actually being clever rather than dumb. Having
correctly decoded the UTF-8 bytes, perhaps it realised that there was no
glyph for the ligature, and rather than display a MISSING CHAR glyph
(usually one of those empty boxes you sometimes see), it normalized it to
ASCII. But if it's that clever, why the hell doesn't it set an encoding
line in posts?????
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-30 07:41 +0000 |
| Message-ID | <5299969b$0$29993$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #60798 |
On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote: > So, who am I to argue with the people who decided that I needed to be > able to type a "PILE OF POO" character. Blame the Japanese for that. Apparently some of the biggest users of Unicode are the various Japanese mobile phone manufacturers, TV stations, map makers and similar. So there's a large number of symbols and emoji (emoticons) specifically added for them, presumably because they pay big dollars to the Unicode Consortium and therefore get a lot of influence in what gets added. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2013-12-01 11:41 +1300 |
| Message-ID | <bfv7sqFpr2mU1@mid.individual.net> |
| In reply to | #60803 |
Steven D'Aprano wrote: > On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote: > >>So, who am I to argue with the people who decided that I needed to be >>able to type a "PILE OF POO" character. > > Blame the Japanese for that. Apparently some of the biggest users of > Unicode are the various Japanese mobile phone manufacturers, TV stations, > map makers and similar. Also there's apparently a pun in Japanese involving the words for 'poo' and 'luck'. So putting a poo symbol in your text message means 'good luck'. Given that, it's not *quite* as silly as it seems. -- Best of poo, Greg
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-11-30 08:07 +0000 |
| Message-ID | <mailman.3422.1385798876.18130.python-list@python.org> |
| In reply to | #60790 |
On 30/11/2013 02:08, Roy Smith wrote: > In article <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > >> (8) What's the uppercase of "baffle" spelled with an ffl ligature? >> >> Like most other languages, Python 3.2 fails: >> >> py> 'baffle'.upper() >> 'BAfflE' >> >> but Python 3.3 passes: >> >> py> 'baffle'.upper() >> 'BAFFLE' > > I disagree. > > The whole idea of ligatures like fi is purely typographic. The crossbar > on the "f" (at least in some fonts) runs into the dot on the "i". > Likewise, the top curl on an "f" run into the serif on top of the "l" > (and similarly for ffl). > > There is no such thing as a "FFL" ligature, because the upper case > letterforms don't run into each other like the lower case ones do. > Thus, I would argue that it's wrong to say that calling upper() on an > ffl ligature should yield FFL. > > I would certainly expect, x.lower() == x.upper().lower(), to be True for > all values of x over the set of valid unicode codepoints. Having > u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) == > len(x.upper()) to be True. > http://bugs.python.org/issue19819 talks about these beasties. Please don't come back to me as I haven't got a clue!!! -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-11-30 11:11 -0800 |
| Message-ID | <b26b96f2-8415-4e58-8c47-243e4605b184@googlegroups.com> |
| In reply to | #60790 |
Le samedi 30 novembre 2013 03:08:49 UTC+1, Roy Smith a écrit : > > > > The whole idea of ligatures like fi is purely typographic. The crossbar > > on the "f" (at least in some fonts) runs into the dot on the "i". > > Likewise, the top curl on an "f" run into the serif on top of the "l" > > (and similarly for ffl). > And do you know the origin of this typographical feature? Because, mechanically, the dot of the "i" broke too often. I cann't proof that's the truth, I read this many times in the literature speaking about typography and about unicode. In my opinion, a very plausible explanation. jmf
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2013-12-01 11:37 +1300 |
| Message-ID | <bfv7lbFppgnU1@mid.individual.net> |
| In reply to | #60808 |
wxjmfauth@gmail.com wrote: > And do you know the origin of this typographical feature? > Because, mechanically, the dot of the "i" broke too often. > > In my opinion, a very plausible explanation. It doesn't sound very plausible to me, because there are a lot more stand-alone 'i's in English text than there are ones following an f. What is there to stop them from breaking? It's more likely to be simply a kerning issue. You want to get the stems of the f and the i close together, and the only practical way to do that with mechanical type is to merge them into one piece of metal. Which makes it even sillier to have an 'ffi' character in this day and age, when you can simply space the characters so that they overlap. -- Greg
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-11-30 18:07 -0500 |
| Message-ID | <mailman.3427.1385852868.18130.python-list@python.org> |
| In reply to | #60809 |
On 11/30/13 5:37 PM, Gregory Ewing wrote: > wxjmfauth@gmail.com wrote: >> And do you know the origin of this typographical feature? >> Because, mechanically, the dot of the "i" broke too often. >> >> In my opinion, a very plausible explanation. > > It doesn't sound very plausible to me, because there > are a lot more stand-alone 'i's in English text than > there are ones following an f. What is there to stop > them from breaking? > > It's more likely to be simply a kerning issue. You > want to get the stems of the f and the i close together, > and the only practical way to do that with mechanical > type is to merge them into one piece of metal. > > Which makes it even sillier to have an 'ffi' character > in this day and age, when you can simply space the > characters so that they overlap. > The fi ligature was created because visually, an f and i wouldn't work well together: the crossbar of the f was near, but not connected to the serif of the i, and the terminal bulb of the f was close to, but not coincident, with the dot of the i. This article goes into great detail, and has a good illustration of how an f and i can clash, and how an fi ligature can fix the problem: http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second fi illustration, which demonstrates using a ligature to make the letters appear *less* connected than they would individually! This is also why "simply spacing the characters" isn't a solution: a specially designed ligature looks better than a separate f and i, no matter how minutely kerned. It's unfortunate that Unicode includes presentation alternatives like the fi (and ff, fl, ffi, and fl) ligatures. It was done to be a superset of existing encodings. Many typefaces have other non-encoded ligatures as well, especially display faces, which also have alternate glyphs. Unicode is a funny mix in that it includes some forms of alternates, but can't include all of them, so we have to put up with both an ad-hoc Unicode that includes presentational variants, and also some other way to specify variants because Unicode can't include all of them. --Ned.
[toc] | [prev] | [next] | [standalone]
Page 1 of 4 [1] 2 3 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web