Groups > comp.lang.python > #60781 > unrolled thread

Python Unicode handling wins again -- mostly

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2013-11-30 00:44 +0000
Last post	2013-12-04 14:38 +0000
Articles	16 on this page of 76 — 22 participants

Back to article view | Back to comp.lang.python

  Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 00:44 +0000
    Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 01:07 +0000
    Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:08 -0500
      Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-11-30 13:12 +1100
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:28 -0500
          Re: Python Unicode handling wins again -- mostly Dave Angel <davea@davea.name> - 2013-11-29 22:06 -0500
      Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 04:21 +0000
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 23:30 -0500
        Re: Python Unicode handling wins again -- mostly Zero Piraeus <z@etiol.net> - 2013-11-30 02:05 -0300
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 06:25 +0000
        Re: Python Unicode handling wins again -- mostly Gene Heskett <gheskett@wdtv.com> - 2013-11-30 00:25 -0500
        Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 00:37 -0500
          Re: Python Unicode handling wins again -- mostly Ian Kelly <ian.g.kelly@gmail.com> - 2013-11-29 23:00 -0700
            Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:11 +0000
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:41 +0000
            Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:41 +1300
      Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 08:07 +0000
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-11-30 11:11 -0800
        Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:37 +1300
          Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-11-30 18:07 -0500
            Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 08:57 -0800
          Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:22 +0000
            Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 18:52 -0600
              Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:54 +0000
                Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 19:05 -0600
                Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:13 +1100
                  Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 20:27 -0500
                    Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:31 +1100
    Re: Python Unicode handling wins again -- mostly Serhiy Storchaka <storchaka@gmail.com> - 2013-12-01 20:00 +0200
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 12:15 -0800
        Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 07:54 +1100
          Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-02 04:39 -0800
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 14:46 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:22 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 15:45 +0000
            Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 02:49 +1100
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:58 -0500
            Re: Python Unicode handling wins again -- mostly Terry Reedy <tjreedy@udel.edu> - 2013-12-02 15:26 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 20:45 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:44 -0500
            Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:25 -0800
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:04 +0000
              Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Roy Smith <roy@panix.com> - 2013-12-02 20:38 -0500
                Pythonista Goals  [was Re: Code of Conduct, Trolls, and Thankless Jobs] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 17:56 -0800
                Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:32 +0000
                  Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:41 +0000
                  Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:14 +0000
                Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:11 +0000
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:23 -0500
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:24 -0500
            Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:32 +0000
            Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:53 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 10:11 +1100
            Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 14:41 -0800
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:22 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:39 -0500
            Re: Code of Conduct, Trolls, and Thankless Jobs  [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 20:11 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 22:06 +0000
        Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 09:29 +1100
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 23:10 +0000
        Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-01 14:50 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 00:43 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 12:38 -0800
    Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:14 -0500
      Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:06 +0000
        Re: Python Unicode handling wins again -- mostly joe <joeedh@gmail.com> - 2013-12-02 23:35 -0800
        Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-03 10:34 -0800
    Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 08:23 +1100
    Re: Python Unicode handling wins again -- mostly MRAB <python@mrabarnett.plus.com> - 2013-12-02 21:27 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:27 -0800
    Re: Python Unicode handling wins again -- mostly Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 09:56 +1100
    Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-03 13:47 +0000
    Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-03 06:26 -0800
      Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-04 05:52 -0800
        Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 14:07 +0000
        Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:38 +0000

Page 4 of 4 — ← Prev page 1 2 3 [4]

#60849

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-12-01 14:50 -0800
Message-ID	<mailman.3452.1385940986.18130.python-list@python.org>
In reply to	#60838

On 12/01/2013 02:06 PM, Mark Lawrence wrote:
>
> I don't remember him [jmf] ever having a valid point, so FTR can we have a reference please.  I do remember Steven D'Aprano
> showing that there was a regression which I flagged up here http://bugs.python.org/issue16061.  It was fixed by Serhiy
> Storchaka, who appears to have forgotten more about Python than I'll ever know, grrr!!! :)

The initial complaint came, unsurprisingly, from jmf.  But don't worry much, even a stopped clock has a better track 
record... it's at least right twice a day.  ;)

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#60850

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-12-02 00:43 +0000
Message-ID	<mailman.3453.1385945036.18130.python-list@python.org>
In reply to	#60838

On 01/12/2013 22:50, Ethan Furman wrote:
> On 12/01/2013 02:06 PM, Mark Lawrence wrote:
>>
>> I don't remember him [jmf] ever having a valid point, so FTR can we
>> have a reference please.  I do remember Steven D'Aprano
>> showing that there was a regression which I flagged up here
>> http://bugs.python.org/issue16061.  It was fixed by Serhiy
>> Storchaka, who appears to have forgotten more about Python than I'll
>> ever know, grrr!!! :)
>
> The initial complaint came, unsurprisingly, from jmf.  But don't worry
> much, even a stopped clock has a better track record... it's at least
> right twice a day.  ;)
>
> --
> ~Ethan~

I had to chuckle, "initial complaint" indeed!!!  He first started 
complaining in August 2012 in this thread 
https://mail.python.org/pipermail/python-list/2012-August/628650.html. 
Then he continued in September 2012 in this thread 
https://mail.python.org/pipermail/python-list/2012-September/631613.html, which 
lead to issue 16061.  He's been continuing to moan on and off ever 
since, but funnily enough has *NEVER* produced a single shred of 
evidence to back his claims.  We'll have to wait until the cows come 
home before he does.

Contrast that to the Victor Stinner statement here 
http://bugs.python.org/issue16061#msg171413 "Python 3.3 is 2x faster 
than Python 3.2 to replace a character with another if the string only 
contains the character 3 times. This is not acceptable, Python 3.3 must 
be as slow as Python 3.2!"  Thinking about that I really do want the 
Python 2 code back.  Apart from the PEP 393 implementation being faster, 
using less memory and being correct, it has nothing to offer.  Now what 
Python sketch does that remind me of? :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60883

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-12-02 12:38 -0800
Message-ID	<mailman.3476.1386017902.18130.python-list@python.org>
In reply to	#60781

On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>
> Out of the nine tests, Python 3.3 passes six, with three tests being
> failures or dubious. If you believe that the native string type should
> operate on code-points, then you'll think that Python does the right
> thing.

I think Python is doing it correctly.  If I want to operate on "clusters" I'll normalize the string first.

Thanks for this excellent post.

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#60884

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-12-02 16:14 -0500
Message-ID	<mailman.3477.1386018871.18130.python-list@python.org>
In reply to	#60781

On 12/2/13 3:38 PM, Ethan Furman wrote:
> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>
>> Out of the nine tests, Python 3.3 passes six, with three tests being
>> failures or dubious. If you believe that the native string type should
>> operate on code-points, then you'll think that Python does the right
>> thing.
>
> I think Python is doing it correctly.  If I want to operate on
> "clusters" I'll normalize the string first.
>
> Thanks for this excellent post.
>
> --
> ~Ethan~

This is where my knowledge about Unicode gets fuzzy.  Isn't it the case 
that some grapheme clusters (or whatever the right word is) can't be 
normalized down to a single code point?  Characters can accept many 
accents, for example.  In that case, you can't always normalize and use 
the existing string methods, but would need more specialized code.

--Ned.

[toc] | [prev] | [next] | [standalone]

#60918

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-12-03 05:06 +0000
Message-ID	<529d66d1$0$11113$c3e8da3@news.astraweb.com>
In reply to	#60884

On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:

> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>>
>> I think Python is doing it correctly.  If I want to operate on
>> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
> 
> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point?  Characters can accept many
> accents, for example.  In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.

That is correct.

If Unicode had a distinct code point for every possible combination of 
base-character plus an arbitrary number of diacritics or accents, the 
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's 
pretend that's only 100, and that you can use up to a maximum of 5 at a 
time. That gives 79375496 combinations per base character, much larger 
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))

# calculate the number of combinations
def comb(r, n):
    """Combinations nCr"""
    p = 1
    for i in range(r+1, n+1):
        p *= i
    for i in range(1, n-r+1):
        p /= i
    return p

sum(comb(i, 100) for i in range(6))

I'm not suggesting that all of those accents are necessarily in use in 
the real world, but there are languages which construct arbitrary 
combinations of accents. (Or so I have been lead to believe.) 

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#60923

From	joe <joeedh@gmail.com>
Date	2013-12-02 23:35 -0800
Message-ID	<mailman.3502.1386056138.18130.python-list@python.org>
In reply to	#60918

[Multipart message — attachments visible in raw view] — view raw

How would a grapheme library work? Basic cluster combination, or would
implementing other algorithms (line break, normalizing to a "canonical"
form) be necessary?

How do people use grapheme clusters in non-rendering situations? Or here's
perhaps here's a better question: does anyone know any non-latin (Japanese
and Arabic come to mind)  speakers who use python to process text in their
own language? Who could perhaps tell us what most bugs them about python's
current api and which standard libraries need work.
On Dec 2, 2013 10:10 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:

> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
>
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> >>>
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
> >>> failures or dubious. If you believe that the native string type should
> >>> operate on code-points, then you'll think that Python does the right
> >>> thing.
> >>
> >> I think Python is doing it correctly.  If I want to operate on
> >> "clusters" I'll normalize the string first.
> >>
> >> Thanks for this excellent post.
> >>
> >> --
> >> ~Ethan~
> >
> > This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> > that some grapheme clusters (or whatever the right word is) can't be
> > normalized down to a single code point?  Characters can accept many
> > accents, for example.  In that case, you can't always normalize and use
> > the existing string methods, but would need more specialized code.
>
> That is correct.
>
> If Unicode had a distinct code point for every possible combination of
> base-character plus an arbitrary number of diacritics or accents, the
> 0x10FFFF code points wouldn't be anywhere near enough.
>
> I see over 300 diacritics used just in the first 5000 code points. Let's
> pretend that's only 100, and that you can use up to a maximum of 5 at a
> time. That gives 79375496 combinations per base character, much larger
> than the total number of Unicode code points in total.
>
> If anyone wishes to check my logic:
>
> # count distinct combining chars
> import unicodedata
> s = ''.join(chr(i) for i in range(33, 5000))
> s = unicodedata.normalize('NFD', s)
> t = [c for c in s if unicodedata.combining(c)]
> len(set(t))
>
> # calculate the number of combinations
> def comb(r, n):
>     """Combinations nCr"""
>     p = 1
>     for i in range(r+1, n+1):
>         p *= i
>     for i in range(1, n-r+1):
>         p /= i
>     return p
>
> sum(comb(i, 100) for i in range(6))
>
>
> I'm not suggesting that all of those accents are necessarily in use in
> the real world, but there are languages which construct arbitrary
> combinations of accents. (Or so I have been lead to believe.)
>
>
> --
> Steven
> --
> https://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]

#60960

From	wxjmfauth@gmail.com
Date	2013-12-03 10:34 -0800
Message-ID	<e693f9a6-c7d0-428d-84fe-86c59014c6ac@googlegroups.com>
In reply to	#60918

Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
> 
> 
> 
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
> 
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> 
> >>>
> 
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
> 
> >>> failures or dubious. If you believe that the native string type should
> 
> >>> operate on code-points, then you'll think that Python does the right
> 
> >>> thing.
> 
> >>
> 
> >> I think Python is doing it correctly.  If I want to operate on
> 
> >> "clusters" I'll normalize the string first.
> 
> >>
> 
> >> Thanks for this excellent post.
> 
> >>
> 
> >> --
> 
> >> ~Ethan~
> 
> > 
> 
> > This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> 
> > that some grapheme clusters (or whatever the right word is) can't be
> 
> > normalized down to a single code point?  Characters can accept many
> 
> > accents, for example.  In that case, you can't always normalize and use
> 
> > the existing string methods, but would need more specialized code.
> 
> 
> 
> That is correct.
> 
> 
> 
> If Unicode had a distinct code point for every possible combination of 
> 
> base-character plus an arbitrary number of diacritics or accents, the 
> 
> 0x10FFFF code points wouldn't be anywhere near enough.
> 
> 
> 
> I see over 300 diacritics used just in the first 5000 code points. Let's 
> 
> pretend that's only 100, and that you can use up to a maximum of 5 at a 
> 
> time. That gives 79375496 combinations per base character, much larger 
> 
> than the total number of Unicode code points in total.
> 
> 
> 
> If anyone wishes to check my logic:
> 
> 
> 
> # count distinct combining chars
> 
> import unicodedata
> 
> s = ''.join(chr(i) for i in range(33, 5000))
> 
> s = unicodedata.normalize('NFD', s)
> 
> t = [c for c in s if unicodedata.combining(c)]
> 
> len(set(t))
> 
> 
> 
> # calculate the number of combinations
> 
> def comb(r, n):
> 
>     """Combinations nCr"""
> 
>     p = 1
> 
>     for i in range(r+1, n+1):
> 
>         p *= i
> 
>     for i in range(1, n-r+1):
> 
>         p /= i
> 
>     return p
> 
> 
> 
> sum(comb(i, 100) for i in range(6))
> 
> 
> 
> 
> 
> I'm not suggesting that all of those accents are necessarily in use in 
> 
> the real world, but there are languages which construct arbitrary 
> 
> combinations of accents. (Or so I have been lead to believe.) 
> 
> 
> 

from one of my libs, bmp only

>>> import fourbiunicode5
>>> print(len(fourbiunicode5.AllCombiningMarks))
240


jmf

[toc] | [prev] | [next] | [standalone]

#60886

From	Chris Angelico <rosuav@gmail.com>
Date	2013-12-03 08:23 +1100
Message-ID	<mailman.3479.1386019386.18130.python-list@python.org>
In reply to	#60781

On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case that
> some grapheme clusters (or whatever the right word is) can't be normalized
> down to a single code point?  Characters can accept many accents, for
> example.

You can't normalize everything down to a single code point, but you
can normalize the other way by breaking out everything that can be
broken out.

>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
'\xe4'
>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
'a\u0308'

ChrisA

[toc] | [prev] | [next] | [standalone]

#60888

From	MRAB <python@mrabarnett.plus.com>
Date	2013-12-02 21:27 +0000
Message-ID	<mailman.3481.1386019633.18130.python-list@python.org>
In reply to	#60781

On 02/12/2013 21:14, Ned Batchelder wrote:
> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>>
>> I think Python is doing it correctly.  If I want to operate on
>> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
>
> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point?  Characters can accept many
> accents, for example.  In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.
>
A better way of saying it is that there are codepoints for some grapheme
clusters. Those 'precomposed' codepoints exist because some legacy
character sets contained them, and having a one-to-one mapping
encouraged Unicode's adoption.

[toc] | [prev] | [next] | [standalone]

#60891

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-12-02 13:27 -0800
Message-ID	<mailman.3484.1386020836.18130.python-list@python.org>
In reply to	#60781

On 12/02/2013 01:23 PM, Chris Angelico wrote:
> On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case that
>> some grapheme clusters (or whatever the right word is) can't be normalized
>> down to a single code point?  Characters can accept many accents, for
>> example.
>
> You can't normalize everything down to a single code point, but you
> can normalize the other way by breaking out everything that can be
> broken out.
>
>>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
> '\xe4'
>>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
> 'a\u0308'

Well, Stephen was right then!  There's room for a library to handle this situation.  Or is there one already?

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#60898

From	Ben Finney <ben+python@benfinney.id.au>
Date	2013-12-03 09:56 +1100
Message-ID	<mailman.3490.1386025030.18130.python-list@python.org>
In reply to	#60781

Ned Batchelder <ned@nedbatchelder.com> writes:

> This is where my knowledge about Unicode gets fuzzy.  Isn't it the
> case that some grapheme clusters (or whatever the right word is) can't
> be normalized down to a single code point?  Characters can accept many
> accents, for example.

That's true, but doesn't affect the point being made: that one can have
both “sequence of Unicode code points” in Python's ‘unicode’ (now ‘str’)
type, and also deal with “sequence of text the reader will see”.

> In that case, you can't always normalize and use the existing string
> methods, but would need more specialized code.

Specialised code may not be needed. It will at least be true that “any
two code-point sequences which normalise to the same value will be
visually the same for the reader”, which is an important assertion for
addressing the complaints from Mortoray's article.

-- 
 \       “Pray, v. To ask that the laws of the universe be annulled in |
  `\     behalf of a single petitioner confessedly unworthy.” —Ambrose |
_o__)                           Bierce, _The Devil's Dictionary_, 1906 |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#60934

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-12-03 13:47 +0000
Message-ID	<mailman.3511.1386078553.18130.python-list@python.org>
In reply to	#60781

On 2013-12-02, Ethan Furman <ethan@stoneleaf.us> wrote:
> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>> Out of the nine tests, Python 3.3 passes six, with three tests
>> being failures or dubious. If you believe that the native
>> string type should operate on code-points, then you'll think
>> that Python does the right thing.
>
> I think Python is doing it correctly.  If I want to operate on
> "clusters" I'll normalize the string first.

Normalizing doesn't resolve the issues the blog brings up; NFC
can't condense every multi-code-point sequence into one, and
normalizing can lose or mangle information. There are good
examples here: http://unicode.org/reports/tr15/

> Thanks for this excellent post.

Agreed.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#60939

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-12-03 06:26 -0800
Message-ID	<mailman.3513.1386082416.18130.python-list@python.org>
In reply to	#60781

On 12/02/2013 12:38 PM, Ethan Furman wrote:
> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>
>> Out of the nine tests, Python 3.3 passes six, with three tests being
>> failures or dubious. If you believe that the native string type should
>> operate on code-points, then you'll think that Python does the right
>> thing.
>
> I think Python is doing it correctly.  If I want to operate on "clusters" I'll normalize the string first.

Hrmm, well, after being educated ;) I think I may have to reverse my position.  Given that not every cluster can be 
normalized to a single code point perhaps Python is doing it the best possible way.  On the other hand, we have a 
uni*code* type, not a uni*char* type.  Maybe 3.5 can have that.  ;)

At any rate, definitely good to be aware of the issue.

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#61020

From	wxjmfauth@gmail.com
Date	2013-12-04 05:52 -0800
Message-ID	<ca8ac2b7-f6a1-4a71-8dae-a4667d6a83b7@googlegroups.com>
In reply to	#60939

Le mardi 3 décembre 2013 15:26:45 UTC+1, Ethan Furman a écrit :
> On 12/02/2013 12:38 PM, Ethan Furman wrote:
> 
> > On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> 
> >>
> 
> >> Out of the nine tests, Python 3.3 passes six, with three tests being
> 
> >> failures or dubious. If you believe that the native string type should
> 
> >> operate on code-points, then you'll think that Python does the right
> 
> >> thing.
> 
> >
> 
> > I think Python is doing it correctly.  If I want to operate on "clusters" I'll normalize the string first.
> 
> 
> 
> Hrmm, well, after being educated ;) I think I may have to reverse my position.  Given that not every cluster can be 
> 
> normalized to a single code point perhaps Python is doing it the best possible way.  On the other hand, we have a 
> 
> uni*code* type, not a uni*char* type.  Maybe 3.5 can have that.  ;)
> 
> 

------


Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf

[toc] | [prev] | [next] | [standalone]

#61021

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-12-04 14:07 +0000
Message-ID	<mailman.3562.1386166062.18130.python-list@python.org>
In reply to	#61020

On 04/12/2013 13:52, wxjmfauth@gmail.com wrote:

[snip all the double spaced stuff]

>
> Yon intuitively pointed a very important feature
> of "unicode". However, it is not necessary, this is
> exactly what unicode does (when used properly).
>
> jmf
>

Presumably using unicode correctly prevents messages being sent across 
the ether with superfluous, extremely irritating double spacing?  Or is 
that down to poor tools in combination with the ignorance of their users?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#61023

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-12-04 14:38 +0000
Message-ID	<mailman.3564.1386168013.18130.python-list@python.org>
In reply to	#61020

On 2013-12-04, wxjmfauth@gmail.com <wxjmfauth@gmail.com> wrote:
> Yon intuitively pointed a very important feature of "unicode".
> However, it is not necessary, this is exactly what unicode does
> (when used properly).

Unicode only provides character sets. It's not a natural language
parsing facility.

-- 
Neil Cerutti

[toc] | [prev] | [standalone]

Page 4 of 4 — ← Prev page 1 2 3 [4]

csiph-web

Python Unicode handling wins again -- mostly

Contents

#60849

#60850

#60883

#60884

#60918

#60923

#60960

#60886

#60888

#60891

#60898

#60934

#60939

#61020

#61021

#61023