Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #60781 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2013-11-30 00:44 +0000 |
| Last post | 2013-12-04 14:38 +0000 |
| Articles | 16 on this page of 76 — 22 participants |
Back to article view | Back to comp.lang.python
Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 00:44 +0000
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 01:07 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:08 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-11-30 13:12 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:28 -0500
Re: Python Unicode handling wins again -- mostly Dave Angel <davea@davea.name> - 2013-11-29 22:06 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 04:21 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 23:30 -0500
Re: Python Unicode handling wins again -- mostly Zero Piraeus <z@etiol.net> - 2013-11-30 02:05 -0300
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 06:25 +0000
Re: Python Unicode handling wins again -- mostly Gene Heskett <gheskett@wdtv.com> - 2013-11-30 00:25 -0500
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 00:37 -0500
Re: Python Unicode handling wins again -- mostly Ian Kelly <ian.g.kelly@gmail.com> - 2013-11-29 23:00 -0700
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:11 +0000
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:41 +0000
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:41 +1300
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 08:07 +0000
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-11-30 11:11 -0800
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:37 +1300
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-11-30 18:07 -0500
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 08:57 -0800
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:22 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 18:52 -0600
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:54 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 19:05 -0600
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:13 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 20:27 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:31 +1100
Re: Python Unicode handling wins again -- mostly Serhiy Storchaka <storchaka@gmail.com> - 2013-12-01 20:00 +0200
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 12:15 -0800
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 07:54 +1100
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-02 04:39 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 14:46 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:22 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 15:45 +0000
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 02:49 +1100
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:58 -0500
Re: Python Unicode handling wins again -- mostly Terry Reedy <tjreedy@udel.edu> - 2013-12-02 15:26 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 20:45 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:44 -0500
Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:25 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:04 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Roy Smith <roy@panix.com> - 2013-12-02 20:38 -0500
Pythonista Goals [was Re: Code of Conduct, Trolls, and Thankless Jobs] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 17:56 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:32 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:41 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:14 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:11 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:23 -0500
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:24 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:32 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:53 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 10:11 +1100
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 14:41 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:22 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:39 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 20:11 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 22:06 +0000
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 09:29 +1100
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 23:10 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-01 14:50 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 00:43 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 12:38 -0800
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:14 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:06 +0000
Re: Python Unicode handling wins again -- mostly joe <joeedh@gmail.com> - 2013-12-02 23:35 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-03 10:34 -0800
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 08:23 +1100
Re: Python Unicode handling wins again -- mostly MRAB <python@mrabarnett.plus.com> - 2013-12-02 21:27 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:27 -0800
Re: Python Unicode handling wins again -- mostly Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 09:56 +1100
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-03 13:47 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-03 06:26 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-04 05:52 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 14:07 +0000
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:38 +0000
Page 4 of 4 — ← Prev page 1 2 3 [4]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2013-12-01 14:50 -0800 |
| Message-ID | <mailman.3452.1385940986.18130.python-list@python.org> |
| In reply to | #60838 |
On 12/01/2013 02:06 PM, Mark Lawrence wrote: > > I don't remember him [jmf] ever having a valid point, so FTR can we have a reference please. I do remember Steven D'Aprano > showing that there was a regression which I flagged up here http://bugs.python.org/issue16061. It was fixed by Serhiy > Storchaka, who appears to have forgotten more about Python than I'll ever know, grrr!!! :) The initial complaint came, unsurprisingly, from jmf. But don't worry much, even a stopped clock has a better track record... it's at least right twice a day. ;) -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-12-02 00:43 +0000 |
| Message-ID | <mailman.3453.1385945036.18130.python-list@python.org> |
| In reply to | #60838 |
On 01/12/2013 22:50, Ethan Furman wrote: > On 12/01/2013 02:06 PM, Mark Lawrence wrote: >> >> I don't remember him [jmf] ever having a valid point, so FTR can we >> have a reference please. I do remember Steven D'Aprano >> showing that there was a regression which I flagged up here >> http://bugs.python.org/issue16061. It was fixed by Serhiy >> Storchaka, who appears to have forgotten more about Python than I'll >> ever know, grrr!!! :) > > The initial complaint came, unsurprisingly, from jmf. But don't worry > much, even a stopped clock has a better track record... it's at least > right twice a day. ;) > > -- > ~Ethan~ I had to chuckle, "initial complaint" indeed!!! He first started complaining in August 2012 in this thread https://mail.python.org/pipermail/python-list/2012-August/628650.html. Then he continued in September 2012 in this thread https://mail.python.org/pipermail/python-list/2012-September/631613.html, which lead to issue 16061. He's been continuing to moan on and off ever since, but funnily enough has *NEVER* produced a single shred of evidence to back his claims. We'll have to wait until the cows come home before he does. Contrast that to the Victor Stinner statement here http://bugs.python.org/issue16061#msg171413 "Python 3.3 is 2x faster than Python 3.2 to replace a character with another if the string only contains the character 3 times. This is not acceptable, Python 3.3 must be as slow as Python 3.2!" Thinking about that I really do want the Python 2 code back. Apart from the PEP 393 implementation being faster, using less memory and being correct, it has nothing to offer. Now what Python sketch does that remind me of? :) -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2013-12-02 12:38 -0800 |
| Message-ID | <mailman.3476.1386017902.18130.python-list@python.org> |
| In reply to | #60781 |
On 11/29/2013 04:44 PM, Steven D'Aprano wrote: > > Out of the nine tests, Python 3.3 passes six, with three tests being > failures or dubious. If you believe that the native string type should > operate on code-points, then you'll think that Python does the right > thing. I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first. Thanks for this excellent post. -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-12-02 16:14 -0500 |
| Message-ID | <mailman.3477.1386018871.18130.python-list@python.org> |
| In reply to | #60781 |
On 12/2/13 3:38 PM, Ethan Furman wrote: > On 11/29/2013 04:44 PM, Steven D'Aprano wrote: >> >> Out of the nine tests, Python 3.3 passes six, with three tests being >> failures or dubious. If you believe that the native string type should >> operate on code-points, then you'll think that Python does the right >> thing. > > I think Python is doing it correctly. If I want to operate on > "clusters" I'll normalize the string first. > > Thanks for this excellent post. > > -- > ~Ethan~ This is where my knowledge about Unicode gets fuzzy. Isn't it the case that some grapheme clusters (or whatever the right word is) can't be normalized down to a single code point? Characters can accept many accents, for example. In that case, you can't always normalize and use the existing string methods, but would need more specialized code. --Ned.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2013-12-03 05:06 +0000 |
| Message-ID | <529d66d1$0$11113$c3e8da3@news.astraweb.com> |
| In reply to | #60884 |
On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>>
>> I think Python is doing it correctly. If I want to operate on
>> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
>
> This is where my knowledge about Unicode gets fuzzy. Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point? Characters can accept many
> accents, for example. In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.
That is correct.
If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.
I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.
If anyone wishes to check my logic:
# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))
# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p
sum(comb(i, 100) for i in range(6))
I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | joe <joeedh@gmail.com> |
|---|---|
| Date | 2013-12-02 23:35 -0800 |
| Message-ID | <mailman.3502.1386056138.18130.python-list@python.org> |
| In reply to | #60918 |
[Multipart message — attachments visible in raw view] — view raw
How would a grapheme library work? Basic cluster combination, or would
implementing other algorithms (line break, normalizing to a "canonical"
form) be necessary?
How do people use grapheme clusters in non-rendering situations? Or here's
perhaps here's a better question: does anyone know any non-latin (Japanese
and Arabic come to mind) speakers who use python to process text in their
own language? Who could perhaps tell us what most bugs them about python's
current api and which standard libraries need work.
On Dec 2, 2013 10:10 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
>
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> >>>
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
> >>> failures or dubious. If you believe that the native string type should
> >>> operate on code-points, then you'll think that Python does the right
> >>> thing.
> >>
> >> I think Python is doing it correctly. If I want to operate on
> >> "clusters" I'll normalize the string first.
> >>
> >> Thanks for this excellent post.
> >>
> >> --
> >> ~Ethan~
> >
> > This is where my knowledge about Unicode gets fuzzy. Isn't it the case
> > that some grapheme clusters (or whatever the right word is) can't be
> > normalized down to a single code point? Characters can accept many
> > accents, for example. In that case, you can't always normalize and use
> > the existing string methods, but would need more specialized code.
>
> That is correct.
>
> If Unicode had a distinct code point for every possible combination of
> base-character plus an arbitrary number of diacritics or accents, the
> 0x10FFFF code points wouldn't be anywhere near enough.
>
> I see over 300 diacritics used just in the first 5000 code points. Let's
> pretend that's only 100, and that you can use up to a maximum of 5 at a
> time. That gives 79375496 combinations per base character, much larger
> than the total number of Unicode code points in total.
>
> If anyone wishes to check my logic:
>
> # count distinct combining chars
> import unicodedata
> s = ''.join(chr(i) for i in range(33, 5000))
> s = unicodedata.normalize('NFD', s)
> t = [c for c in s if unicodedata.combining(c)]
> len(set(t))
>
> # calculate the number of combinations
> def comb(r, n):
> """Combinations nCr"""
> p = 1
> for i in range(r+1, n+1):
> p *= i
> for i in range(1, n-r+1):
> p /= i
> return p
>
> sum(comb(i, 100) for i in range(6))
>
>
> I'm not suggesting that all of those accents are necessarily in use in
> the real world, but there are languages which construct arbitrary
> combinations of accents. (Or so I have been lead to believe.)
>
>
> --
> Steven
> --
> https://mail.python.org/mailman/listinfo/python-list
>
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-12-03 10:34 -0800 |
| Message-ID | <e693f9a6-c7d0-428d-84fe-86c59014c6ac@googlegroups.com> |
| In reply to | #60918 |
Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
>
>
>
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
>
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>
> >>>
>
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
>
> >>> failures or dubious. If you believe that the native string type should
>
> >>> operate on code-points, then you'll think that Python does the right
>
> >>> thing.
>
> >>
>
> >> I think Python is doing it correctly. If I want to operate on
>
> >> "clusters" I'll normalize the string first.
>
> >>
>
> >> Thanks for this excellent post.
>
> >>
>
> >> --
>
> >> ~Ethan~
>
> >
>
> > This is where my knowledge about Unicode gets fuzzy. Isn't it the case
>
> > that some grapheme clusters (or whatever the right word is) can't be
>
> > normalized down to a single code point? Characters can accept many
>
> > accents, for example. In that case, you can't always normalize and use
>
> > the existing string methods, but would need more specialized code.
>
>
>
> That is correct.
>
>
>
> If Unicode had a distinct code point for every possible combination of
>
> base-character plus an arbitrary number of diacritics or accents, the
>
> 0x10FFFF code points wouldn't be anywhere near enough.
>
>
>
> I see over 300 diacritics used just in the first 5000 code points. Let's
>
> pretend that's only 100, and that you can use up to a maximum of 5 at a
>
> time. That gives 79375496 combinations per base character, much larger
>
> than the total number of Unicode code points in total.
>
>
>
> If anyone wishes to check my logic:
>
>
>
> # count distinct combining chars
>
> import unicodedata
>
> s = ''.join(chr(i) for i in range(33, 5000))
>
> s = unicodedata.normalize('NFD', s)
>
> t = [c for c in s if unicodedata.combining(c)]
>
> len(set(t))
>
>
>
> # calculate the number of combinations
>
> def comb(r, n):
>
> """Combinations nCr"""
>
> p = 1
>
> for i in range(r+1, n+1):
>
> p *= i
>
> for i in range(1, n-r+1):
>
> p /= i
>
> return p
>
>
>
> sum(comb(i, 100) for i in range(6))
>
>
>
>
>
> I'm not suggesting that all of those accents are necessarily in use in
>
> the real world, but there are languages which construct arbitrary
>
> combinations of accents. (Or so I have been lead to believe.)
>
>
>
from one of my libs, bmp only
>>> import fourbiunicode5
>>> print(len(fourbiunicode5.AllCombiningMarks))
240
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-12-03 08:23 +1100 |
| Message-ID | <mailman.3479.1386019386.18130.python-list@python.org> |
| In reply to | #60781 |
On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
> This is where my knowledge about Unicode gets fuzzy. Isn't it the case that
> some grapheme clusters (or whatever the right word is) can't be normalized
> down to a single code point? Characters can accept many accents, for
> example.
You can't normalize everything down to a single code point, but you
can normalize the other way by breaking out everything that can be
broken out.
>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
'\xe4'
>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
'a\u0308'
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-12-02 21:27 +0000 |
| Message-ID | <mailman.3481.1386019633.18130.python-list@python.org> |
| In reply to | #60781 |
On 02/12/2013 21:14, Ned Batchelder wrote: > On 12/2/13 3:38 PM, Ethan Furman wrote: >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote: >>> >>> Out of the nine tests, Python 3.3 passes six, with three tests being >>> failures or dubious. If you believe that the native string type should >>> operate on code-points, then you'll think that Python does the right >>> thing. >> >> I think Python is doing it correctly. If I want to operate on >> "clusters" I'll normalize the string first. >> >> Thanks for this excellent post. >> >> -- >> ~Ethan~ > > This is where my knowledge about Unicode gets fuzzy. Isn't it the case > that some grapheme clusters (or whatever the right word is) can't be > normalized down to a single code point? Characters can accept many > accents, for example. In that case, you can't always normalize and use > the existing string methods, but would need more specialized code. > A better way of saying it is that there are codepoints for some grapheme clusters. Those 'precomposed' codepoints exist because some legacy character sets contained them, and having a one-to-one mapping encouraged Unicode's adoption.
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2013-12-02 13:27 -0800 |
| Message-ID | <mailman.3484.1386020836.18130.python-list@python.org> |
| In reply to | #60781 |
On 12/02/2013 01:23 PM, Chris Angelico wrote:
> On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> This is where my knowledge about Unicode gets fuzzy. Isn't it the case that
>> some grapheme clusters (or whatever the right word is) can't be normalized
>> down to a single code point? Characters can accept many accents, for
>> example.
>
> You can't normalize everything down to a single code point, but you
> can normalize the other way by breaking out everything that can be
> broken out.
>
>>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
> '\xe4'
>>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
> 'a\u0308'
Well, Stephen was right then! There's room for a library to handle this situation. Or is there one already?
--
~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2013-12-03 09:56 +1100 |
| Message-ID | <mailman.3490.1386025030.18130.python-list@python.org> |
| In reply to | #60781 |
Ned Batchelder <ned@nedbatchelder.com> writes: > This is where my knowledge about Unicode gets fuzzy. Isn't it the > case that some grapheme clusters (or whatever the right word is) can't > be normalized down to a single code point? Characters can accept many > accents, for example. That's true, but doesn't affect the point being made: that one can have both “sequence of Unicode code points” in Python's ‘unicode’ (now ‘str’) type, and also deal with “sequence of text the reader will see”. > In that case, you can't always normalize and use the existing string > methods, but would need more specialized code. Specialised code may not be needed. It will at least be true that “any two code-point sequences which normalise to the same value will be visually the same for the reader”, which is an important assertion for addressing the complaints from Mortoray's article. -- \ “Pray, v. To ask that the laws of the universe be annulled in | `\ behalf of a single petitioner confessedly unworthy.” —Ambrose | _o__) Bierce, _The Devil's Dictionary_, 1906 | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-12-03 13:47 +0000 |
| Message-ID | <mailman.3511.1386078553.18130.python-list@python.org> |
| In reply to | #60781 |
On 2013-12-02, Ethan Furman <ethan@stoneleaf.us> wrote: > On 11/29/2013 04:44 PM, Steven D'Aprano wrote: >> Out of the nine tests, Python 3.3 passes six, with three tests >> being failures or dubious. If you believe that the native >> string type should operate on code-points, then you'll think >> that Python does the right thing. > > I think Python is doing it correctly. If I want to operate on > "clusters" I'll normalize the string first. Normalizing doesn't resolve the issues the blog brings up; NFC can't condense every multi-code-point sequence into one, and normalizing can lose or mangle information. There are good examples here: http://unicode.org/reports/tr15/ > Thanks for this excellent post. Agreed. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2013-12-03 06:26 -0800 |
| Message-ID | <mailman.3513.1386082416.18130.python-list@python.org> |
| In reply to | #60781 |
On 12/02/2013 12:38 PM, Ethan Furman wrote: > On 11/29/2013 04:44 PM, Steven D'Aprano wrote: >> >> Out of the nine tests, Python 3.3 passes six, with three tests being >> failures or dubious. If you believe that the native string type should >> operate on code-points, then you'll think that Python does the right >> thing. > > I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first. Hrmm, well, after being educated ;) I think I may have to reverse my position. Given that not every cluster can be normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a uni*code* type, not a uni*char* type. Maybe 3.5 can have that. ;) At any rate, definitely good to be aware of the issue. -- ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-12-04 05:52 -0800 |
| Message-ID | <ca8ac2b7-f6a1-4a71-8dae-a4667d6a83b7@googlegroups.com> |
| In reply to | #60939 |
Le mardi 3 décembre 2013 15:26:45 UTC+1, Ethan Furman a écrit : > On 12/02/2013 12:38 PM, Ethan Furman wrote: > > > On 11/29/2013 04:44 PM, Steven D'Aprano wrote: > > >> > > >> Out of the nine tests, Python 3.3 passes six, with three tests being > > >> failures or dubious. If you believe that the native string type should > > >> operate on code-points, then you'll think that Python does the right > > >> thing. > > > > > > I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first. > > > > Hrmm, well, after being educated ;) I think I may have to reverse my position. Given that not every cluster can be > > normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a > > uni*code* type, not a uni*char* type. Maybe 3.5 can have that. ;) > > ------ Yon intuitively pointed a very important feature of "unicode". However, it is not necessary, this is exactly what unicode does (when used properly). jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-12-04 14:07 +0000 |
| Message-ID | <mailman.3562.1386166062.18130.python-list@python.org> |
| In reply to | #61020 |
On 04/12/2013 13:52, wxjmfauth@gmail.com wrote: [snip all the double spaced stuff] > > Yon intuitively pointed a very important feature > of "unicode". However, it is not necessary, this is > exactly what unicode does (when used properly). > > jmf > Presumably using unicode correctly prevents messages being sent across the ether with superfluous, extremely irritating double spacing? Or is that down to poor tools in combination with the ignorance of their users? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2013-12-04 14:38 +0000 |
| Message-ID | <mailman.3564.1386168013.18130.python-list@python.org> |
| In reply to | #61020 |
On 2013-12-04, wxjmfauth@gmail.com <wxjmfauth@gmail.com> wrote: > Yon intuitively pointed a very important feature of "unicode". > However, it is not necessary, this is exactly what unicode does > (when used properly). Unicode only provides character sets. It's not a natural language parsing facility. -- Neil Cerutti
[toc] | [prev] | [standalone]
Page 4 of 4 — ← Prev page 1 2 3 [4]
Back to top | Article view | comp.lang.python
csiph-web