Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin1!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sat, 26 Oct 2013 21:17:29 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: Roy Smith <roy@panix.com>
Subject: Re: trying to strip out non ascii.. or rather convert non ascii
In-Reply-To: <roy-4807F0.21542726102013@news.panix.com>
References: <mailman.1604.1382818293.18130.python-list@python.org> <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> <mailman.1628.1382838024.18130.python-list@python.org> <roy-4807F0.21542726102013@news.panix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1632.1382840149.18130.python-list@python.org>
Lines: 33
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:57703

On 2013-10-26 21:54, Roy Smith wrote:
> In article <mailman.1628.1382838024.18130.python-list@python.org>,
>  Tim Chase <python.list@tim.thechases.com> wrote:
>> I'd be just as happy if Python provided a "sloppy string compare"
>> that ignored case, diacritical marks, and the like.
> 
> The problem with putting fuzzy matching in the core language is
> that there is no general agreement on how it's supposed to work.
> 
> There are, however, third-party libraries which do fuzzy matching.
> One popular one is jellyfish
> (https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.

> Don't expect you can just download and use it right out of the box,
> however. You'll need to do a little thinking about which of the
> several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

  from unicodedata import normalize
  "".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc