Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <873871fgxk.fsf@elektro.pacujo.net>
References: <54C27E13.5090808@ntlworld.com> <mailman.18046.1422035592.18130.python-list@python.org> <873871fgxk.fsf@elektro.pacujo.net>
Date: Sat, 24 Jan 2015 06:56:19 +1100
Subject: Re: Case-insensitive sorting of strings (Python newbie)
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.18057.1422042982.18130.python-list@python.org>
Lines: 53
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:84392

On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
>     >>> print("re\u0301sume\u0301")
>     r=C3=A9sum=C3=A9
>     >>> print("r\u00e9sum\u00e9")
>     r=C3=A9sum=C3=A9
>     >>> print("re\u0301sume\u0301" =3D=3D "r\u00e9sum\u00e9")
>     False
>     >>> print("\ufb01nd")
>     find
>     >>> print("find")
>     find
>     >>> print("\ufb01nd" =3D=3D "find")
>     False
>
> If equality can't be determined, words really can't be sorted.

Ah, that's a bit easier to deal with. Just use Unicode normalization.

>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") =3D=3D unicoded=
ata.normalize("NFC","r\u00e9sum\u00e9"))
True

It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:

def key(s):
    """Normalize a Unicode string for comparison purposes.

    Composes, case-folds, and trims excess spaces.
    """
    return unicodedata.normalize("NFC",s).strip().casefold()

Then it's much tidier:

>>> print(key("re\u0301sume\u0301") =3D=3D key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") =3D=3D key("find"))
True

You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.

ChrisA