Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'subject:Python': 0.06; 'correct.': 0.07; 'subject: -- ': 0.07; 'string': 0.09; 'character,': 0.09; 'methods,': 0.09; 'pretend': 0.09; 'tests,': 0.09; '~ethan~': 0.09; 'api': 0.11; 'cc:addr:python-list': 0.11; 'python': 0.11; 'def': 0.12; "wouldn't": 0.14; '>>': 0.16; 'combination,': 0.16; 'combinations': 0.16; 'distinct': 0.16; 'enough.': 0.16; 'language?': 0.16; 'subject:Unicode': 0.16; 'subject:handling': 0.16; 'to:addr:pearwood.info': 0.16; "to:name:steven d'aprano": 0.16; 'wrote:': 0.18; 'code.': 0.18; 'library': 0.18; 'first.': 0.19; 'implementing': 0.19; "python's": 0.19; '>>>': 0.22; 'import': 0.22; 'tests': 0.22; 'cc:addr:python.org': 0.22; '>>>': 0.24; 'example.': 0.24; 'passes': 0.24; 'unicode': 0.24; '\xa0if': 0.24; 'mon,': 0.24; '(or': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; '>': 0.26; 'speakers': 0.26; 'world,': 0.26; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'points': 0.29; 'characters': 0.30; 'dec': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'url:mailman': 0.30; 'work.': 0.31; 'gives': 0.31; 'code': 0.31; "skip:' 10": 0.31; 'correctly.': 0.31; "d'aprano": 0.31; 'libraries': 0.31; 'post.': 0.31; 'question:': 0.31; 'steven': 0.31; 'anyone': 0.31; 'languages': 0.32; 'text': 0.33; 'url:python': 0.33; 'bugs': 0.33; 'skip:& 30': 0.33; 'could': 0.34; 'knowledge': 0.35; 'basic': 0.35; "can't": 0.35; 'anywhere': 0.35; 'skip:u 20': 0.35; 'case,': 0.35; 'operate': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; '5000': 0.36; 'combination': 0.36; 'url:listinfo': 0.36; 'doing': 0.36; 'thanks': 0.36; "i'll": 0.36; 'possible': 0.36; 'url:org': 0.36; 'should': 0.36; 'being': 0.38; 'work?': 0.38; 'whatever': 0.38; 'pm,': 0.38; 'skip:& 20': 0.39; 'does': 0.39; 'url:mail': 0.40; 'how': 0.40; 'algorithms': 0.60; 'failures': 0.60; 'most': 0.60; 'tell': 0.60; 'first': 0.61; "you'll": 0.62; 'real': 0.63; 'maximum': 0.63; 'skip:n 10': 0.64; 'more': 0.64; 'total': 0.65; 'specialized': 0.65; 'believe': 0.68; 'combining': 0.68; 'batchelder': 0.84; 'ethan': 0.84; 'furman': 0.84; 'is)': 0.84; 'n):': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=FntlVmHHEOhq33INz8RbaCzaFIuGt0Gk4yL7d3Uh8Fc=; b=PTowl5z8blJzHKoWugsGZ73yCSKGaxSqnrdqhw6PvBi0xgBJvsTfGM73Et1fUgHxL3 zHZxs89PQHf6e3y0J/Ux5aJhZFo1A1hQl5XrKNCBVqqOEvPRqQjOJ8IKj3npdO7bfrLL Q40FSPirgwPUoDeBk5TqjGLgai7dssRJAL8rvFoVm6ytkiRepp2+a06LN5Fvjco9aahO E9ZUeDH6ifxchf2Nrd6c9bgUDS3VuoDgTV0Ua+MSBEiZzsyvXAjq4Q5S0lkITqhXF7ya sdxaGBXZp1bqIYpValr9gUUhvtKp5TCvx19rSGzuUpk05hgEvrjMyc13rmlijmWtm+np bmYw== MIME-Version: 1.0 X-Received: by 10.220.192.198 with SMTP id dr6mr5278010vcb.19.1386056129999; Mon, 02 Dec 2013 23:35:29 -0800 (PST) In-Reply-To: <529d66d1$0$11113$c3e8da3@news.astraweb.com> References: <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com> <529CEFB1.2030007@stoneleaf.us> <529d66d1$0$11113$c3e8da3@news.astraweb.com> Date: Mon, 2 Dec 2013 23:35:29 -0800 Subject: Re: Python Unicode handling wins again -- mostly From: joe To: "Steven D'Aprano" Content-Type: multipart/alternative; boundary=001a11c2cde444ae6c04ec9c57ba Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 181 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1386056138 news.xs4all.nl 15997 [2001:888:2000:d::a6]:50961 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:60923 --001a11c2cde444ae6c04ec9c57ba Content-Type: text/plain; charset=ISO-8859-1 How would a grapheme library work? Basic cluster combination, or would implementing other algorithms (line break, normalizing to a "canonical" form) be necessary? How do people use grapheme clusters in non-rendering situations? Or here's perhaps here's a better question: does anyone know any non-latin (Japanese and Arabic come to mind) speakers who use python to process text in their own language? Who could perhaps tell us what most bugs them about python's current api and which standard libraries need work. On Dec 2, 2013 10:10 PM, "Steven D'Aprano" wrote: > On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote: > > > On 12/2/13 3:38 PM, Ethan Furman wrote: > >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote: > >>> > >>> Out of the nine tests, Python 3.3 passes six, with three tests being > >>> failures or dubious. If you believe that the native string type should > >>> operate on code-points, then you'll think that Python does the right > >>> thing. > >> > >> I think Python is doing it correctly. If I want to operate on > >> "clusters" I'll normalize the string first. > >> > >> Thanks for this excellent post. > >> > >> -- > >> ~Ethan~ > > > > This is where my knowledge about Unicode gets fuzzy. Isn't it the case > > that some grapheme clusters (or whatever the right word is) can't be > > normalized down to a single code point? Characters can accept many > > accents, for example. In that case, you can't always normalize and use > > the existing string methods, but would need more specialized code. > > That is correct. > > If Unicode had a distinct code point for every possible combination of > base-character plus an arbitrary number of diacritics or accents, the > 0x10FFFF code points wouldn't be anywhere near enough. > > I see over 300 diacritics used just in the first 5000 code points. Let's > pretend that's only 100, and that you can use up to a maximum of 5 at a > time. That gives 79375496 combinations per base character, much larger > than the total number of Unicode code points in total. > > If anyone wishes to check my logic: > > # count distinct combining chars > import unicodedata > s = ''.join(chr(i) for i in range(33, 5000)) > s = unicodedata.normalize('NFD', s) > t = [c for c in s if unicodedata.combining(c)] > len(set(t)) > > # calculate the number of combinations > def comb(r, n): > """Combinations nCr""" > p = 1 > for i in range(r+1, n+1): > p *= i > for i in range(1, n-r+1): > p /= i > return p > > sum(comb(i, 100) for i in range(6)) > > > I'm not suggesting that all of those accents are necessarily in use in > the real world, but there are languages which construct arbitrary > combinations of accents. (Or so I have been lead to believe.) > > > -- > Steven > -- > https://mail.python.org/mailman/listinfo/python-list > --001a11c2cde444ae6c04ec9c57ba Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

How would a grapheme library work? Basic cluster combination= , or would implementing other algorithms (line break, normalizing to a &quo= t;canonical" form) be necessary?

How do people use grapheme clusters in non-rendering situati= ons? Or here's perhaps here's a better question: does anyone know a= ny non-latin (Japanese and Arabic come to mind)=A0 speakers who use python = to process text in their own language? Who could perhaps tell us what most = bugs them about python's current api and which standard libraries need = work.

On Dec 2, 2013 10:10 PM, "Steven D'Apra= no" <steve@pearwood.info= > wrote:
On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:

> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests= being
>>> failures or dubious. If you believe that the native string typ= e should
>>> operate on code-points, then you'll think that Python does= the right
>>> thing.
>>
>> I think Python is doing it correctly. =A0If I want to operate on >> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
>
> This is where my knowledge about Unicode gets fuzzy. =A0Isn't it t= he case
> that some grapheme clusters (or whatever the right word is) can't = be
> normalized down to a single code point? =A0Characters can accept many<= br> > accents, for example. =A0In that case, you can't always normalize = and use
> the existing string methods, but would need more specialized code.

That is correct.

If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let'= s
pretend that's only 100, and that you can use up to a maximum of 5 at a=
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s =3D ''.join(chr(i) for i in range(33, 5000))
s =3D unicodedata.normalize('NFD', s)
t =3D [c for c in s if unicodedata.combining(c)]
len(set(t))

# calculate the number of combinations
def comb(r, n):
=A0 =A0 """Combinations nCr"""
=A0 =A0 p =3D 1
=A0 =A0 for i in range(r+1, n+1):
=A0 =A0 =A0 =A0 p *=3D i
=A0 =A0 for i in range(1, n-r+1):
=A0 =A0 =A0 =A0 p /=3D i
=A0 =A0 return p

sum(comb(i, 100) for i in range(6))


I'm not suggesting that all of those accents are necessarily in use in<= br> the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)


--
Steven
--
https://mail.python.org/mailman/listinfo/python-list
--001a11c2cde444ae6c04ec9c57ba--