Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #60923
| References | <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com> <529CEFB1.2030007@stoneleaf.us> <mailman.3477.1386018871.18130.python-list@python.org> <529d66d1$0$11113$c3e8da3@news.astraweb.com> |
|---|---|
| Date | 2013-12-02 23:35 -0800 |
| Subject | Re: Python Unicode handling wins again -- mostly |
| From | joe <joeedh@gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.3502.1386056138.18130.python-list@python.org> (permalink) |
[Multipart message — attachments visible in raw view] - view raw
How would a grapheme library work? Basic cluster combination, or would
implementing other algorithms (line break, normalizing to a "canonical"
form) be necessary?
How do people use grapheme clusters in non-rendering situations? Or here's
perhaps here's a better question: does anyone know any non-latin (Japanese
and Arabic come to mind) speakers who use python to process text in their
own language? Who could perhaps tell us what most bugs them about python's
current api and which standard libraries need work.
On Dec 2, 2013 10:10 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
>
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> >>>
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
> >>> failures or dubious. If you believe that the native string type should
> >>> operate on code-points, then you'll think that Python does the right
> >>> thing.
> >>
> >> I think Python is doing it correctly. If I want to operate on
> >> "clusters" I'll normalize the string first.
> >>
> >> Thanks for this excellent post.
> >>
> >> --
> >> ~Ethan~
> >
> > This is where my knowledge about Unicode gets fuzzy. Isn't it the case
> > that some grapheme clusters (or whatever the right word is) can't be
> > normalized down to a single code point? Characters can accept many
> > accents, for example. In that case, you can't always normalize and use
> > the existing string methods, but would need more specialized code.
>
> That is correct.
>
> If Unicode had a distinct code point for every possible combination of
> base-character plus an arbitrary number of diacritics or accents, the
> 0x10FFFF code points wouldn't be anywhere near enough.
>
> I see over 300 diacritics used just in the first 5000 code points. Let's
> pretend that's only 100, and that you can use up to a maximum of 5 at a
> time. That gives 79375496 combinations per base character, much larger
> than the total number of Unicode code points in total.
>
> If anyone wishes to check my logic:
>
> # count distinct combining chars
> import unicodedata
> s = ''.join(chr(i) for i in range(33, 5000))
> s = unicodedata.normalize('NFD', s)
> t = [c for c in s if unicodedata.combining(c)]
> len(set(t))
>
> # calculate the number of combinations
> def comb(r, n):
> """Combinations nCr"""
> p = 1
> for i in range(r+1, n+1):
> p *= i
> for i in range(1, n-r+1):
> p /= i
> return p
>
> sum(comb(i, 100) for i in range(6))
>
>
> I'm not suggesting that all of those accents are necessarily in use in
> the real world, but there are languages which construct arbitrary
> combinations of accents. (Or so I have been lead to believe.)
>
>
> --
> Steven
> --
> https://mail.python.org/mailman/listinfo/python-list
>
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 00:44 +0000
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 01:07 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:08 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-11-30 13:12 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 21:28 -0500
Re: Python Unicode handling wins again -- mostly Dave Angel <davea@davea.name> - 2013-11-29 22:06 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 04:21 +0000
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-29 23:30 -0500
Re: Python Unicode handling wins again -- mostly Zero Piraeus <z@etiol.net> - 2013-11-30 02:05 -0300
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 06:25 +0000
Re: Python Unicode handling wins again -- mostly Gene Heskett <gheskett@wdtv.com> - 2013-11-30 00:25 -0500
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 00:37 -0500
Re: Python Unicode handling wins again -- mostly Ian Kelly <ian.g.kelly@gmail.com> - 2013-11-29 23:00 -0700
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:11 +0000
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-30 07:41 +0000
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:41 +1300
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-30 08:07 +0000
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-11-30 11:11 -0800
Re: Python Unicode handling wins again -- mostly Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-01 11:37 +1300
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-11-30 18:07 -0500
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 08:57 -0800
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:22 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 18:52 -0600
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-01 00:54 +0000
Re: Python Unicode handling wins again -- mostly Tim Chase <python.list@tim.thechases.com> - 2013-11-30 19:05 -0600
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:13 +1100
Re: Python Unicode handling wins again -- mostly Roy Smith <roy@panix.com> - 2013-11-30 20:27 -0500
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-01 12:31 +1100
Re: Python Unicode handling wins again -- mostly Serhiy Storchaka <storchaka@gmail.com> - 2013-12-01 20:00 +0200
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-01 12:15 -0800
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 07:54 +1100
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-02 04:39 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 14:46 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:22 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 15:45 +0000
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 02:49 +1100
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 10:58 -0500
Re: Python Unicode handling wins again -- mostly Terry Reedy <tjreedy@udel.edu> - 2013-12-02 15:26 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 20:45 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:44 -0500
Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:25 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:04 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Roy Smith <roy@panix.com> - 2013-12-02 20:38 -0500
Pythonista Goals [was Re: Code of Conduct, Trolls, and Thankless Jobs] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 17:56 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:32 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:41 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:14 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:11 +0000
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:23 -0500
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:24 -0500
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 22:32 +0000
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 17:53 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 10:11 +1100
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 14:41 -0800
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:22 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs Terry Reedy <tjreedy@udel.edu> - 2013-12-02 22:39 -0500
Re: Code of Conduct, Trolls, and Thankless Jobs [was Re: Python Unicode handling wins again -- mostly] Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 20:11 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 22:06 +0000
Re: Python Unicode handling wins again -- mostly Tim Delaney <timothy.c.delaney@gmail.com> - 2013-12-02 09:29 +1100
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-01 23:10 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-01 14:50 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 00:43 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 12:38 -0800
Re: Python Unicode handling wins again -- mostly Ned Batchelder <ned@nedbatchelder.com> - 2013-12-02 16:14 -0500
Re: Python Unicode handling wins again -- mostly Steven D'Aprano <steve@pearwood.info> - 2013-12-03 05:06 +0000
Re: Python Unicode handling wins again -- mostly joe <joeedh@gmail.com> - 2013-12-02 23:35 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-03 10:34 -0800
Re: Python Unicode handling wins again -- mostly Chris Angelico <rosuav@gmail.com> - 2013-12-03 08:23 +1100
Re: Python Unicode handling wins again -- mostly MRAB <python@mrabarnett.plus.com> - 2013-12-02 21:27 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-02 13:27 -0800
Re: Python Unicode handling wins again -- mostly Ben Finney <ben+python@benfinney.id.au> - 2013-12-03 09:56 +1100
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-03 13:47 +0000
Re: Python Unicode handling wins again -- mostly Ethan Furman <ethan@stoneleaf.us> - 2013-12-03 06:26 -0800
Re: Python Unicode handling wins again -- mostly wxjmfauth@gmail.com - 2013-12-04 05:52 -0800
Re: Python Unicode handling wins again -- mostly Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 14:07 +0000
Re: Python Unicode handling wins again -- mostly Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:38 +0000
csiph-web