Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Mon, 02 Dec 2013 13:27:08 -0800
From: Ethan Furman <ethan@stoneleaf.us>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121010 Thunderbird/16.0.1
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Python Unicode handling wins again -- mostly
References: <529934dc$0$29993$c3e8da3$5496439d@news.astraweb.com> <529CEFB1.2030007@stoneleaf.us> <l7it6r$e6e$1@ger.gmane.org> <CAPTjJmoD9+Ka6S6YRGgGdPu2A6Lx1bKebkKNUN+mOWjg7JMGEw@mail.gmail.com>
In-Reply-To: <CAPTjJmoD9+Ka6S6YRGgGdPu2A6Lx1bKebkKNUN+mOWjg7JMGEw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3484.1386020836.18130.python-list@python.org>
Lines: 20
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:60891

On 12/02/2013 01:23 PM, Chris Angelico wrote:
> On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case that
>> some grapheme clusters (or whatever the right word is) can't be normalized
>> down to a single code point?  Characters can accept many accents, for
>> example.
>
> You can't normalize everything down to a single code point, but you
> can normalize the other way by breaking out everything that can be
> broken out.
>
>>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
> '\xe4'
>>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
> 'a\u0308'

Well, Stephen was right then!  There's room for a library to handle this situation.  Or is there one already?

--
~Ethan~