Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #84392

Re: Case-insensitive sorting of strings (Python newbie)

References <54C27E13.5090808@ntlworld.com> <mailman.18046.1422035592.18130.python-list@python.org> <873871fgxk.fsf@elektro.pacujo.net>
Date 2015-01-24 06:56 +1100
Subject Re: Case-insensitive sorting of strings (Python newbie)
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.18057.1422042982.18130.python-list@python.org> (permalink)

Show all headers | View raw


On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
>     >>> print("re\u0301sume\u0301")
>     résumé
>     >>> print("r\u00e9sum\u00e9")
>     résumé
>     >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
>     False
>     >>> print("\ufb01nd")
>     find
>     >>> print("find")
>     find
>     >>> print("\ufb01nd" == "find")
>     False
>
> If equality can't be determined, words really can't be sorted.

Ah, that's a bit easier to deal with. Just use Unicode normalization.

>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True

It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:

def key(s):
    """Normalize a Unicode string for comparison purposes.

    Composes, case-folds, and trims excess spaces.
    """
    return unicodedata.normalize("NFC",s).strip().casefold()

Then it's much tidier:

>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True

You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Re: Case-insensitive sorting of strings (Python newbie) Peter Otten <__peter__@web.de> - 2015-01-23 18:53 +0100
  Re: Case-insensitive sorting of strings (Python newbie) Marko Rauhamaa <marko@pacujo.net> - 2015-01-23 21:14 +0200
    Re: Case-insensitive sorting of strings (Python newbie) Chris Angelico <rosuav@gmail.com> - 2015-01-24 06:56 +1100
  Re: Case-insensitive sorting of strings (Python newbie) wxjmfauth@gmail.com - 2015-01-24 02:34 -0800

csiph-web