Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Sat, 26 Oct 2013 20:41:58 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: python-list@python.org
Subject: Re: trying to strip out non ascii.. or rather convert non ascii
In-Reply-To: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com>
References: <mailman.1604.1382818293.18130.python-list@python.org> <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1628.1382838024.18130.python-list@python.org>
Lines: 38
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:57695

On 2013-10-26 22:24, Steven D'Aprano wrote:
> Why on earth would you want to throw away perfectly good
> information?=20

The main reason I've needed to do it in the past is for normalization
of search queries.  When a user wants to find something containing
"ping=C3=BCino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

  if term.upper() in datum.upper():
    it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

  unicode_haystack1 =3D u"ping=C3=BCino"
  unicode_haystack2 =3D u"=C2=A1Mir=C3=A9 un ping=C3=BCino!"
  needle =3D u"pinguino"
  if unicode_haystack1.sloppy_equals(needle):
    it_matches()
  if unicode_haystack2.sloppy_contains(needle):
    it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :-)

-tkc