Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #84392
| Path | csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <rosuav@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'anyway.': 0.05; 'subject:Python': 0.06; '"""': 0.07; 'string': 0.09; 'spaces': 0.09; 'translate': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'def': 0.12; 'jan': 0.12; '24,': 0.16; 'ah,': 0.16; 'comparisons,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'instance:': 0.16; 'subject:Case': 0.16; 'subject:insensitive': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'bit': 0.19; '>>>': 0.22; 'cc:addr:python.org': 0.22; 'certainly': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'world,': 0.26; 'header:In-Reply-To:1': 0.27; 'skip:p 30': 0.29; 'am,': 0.29; 'words': 0.29; 'message-id:@mail.gmail.com': 0.30; 'work.': 0.31; 'easier': 0.31; 'breaking': 0.31; 'comparison': 0.31; 'equality': 0.31; 'strip': 0.31; 'probably': 0.32; 'maybe': 0.34; 'subject: (': 0.35; 'basic': 0.35; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'really': 0.36; 'false': 0.36; 'doing': 0.36; 'depends': 0.38; 'skip:p 20': 0.39; "you're": 0.61; "you'll": 0.62; 'skip:n 10': 0.64; 'nobody': 0.68; 'safe': 0.72; 'further,': 0.74; '2015': 0.84; 'etc,': 0.84; 'to:none': 0.92 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=mEnzEN4w8GTZI3pCuDPO0YumT6BwYoE26850U2zOcu0=; b=eCUJwr4qno2gOIsD4jf1O0+IiSteJw/fD1xsbh6Peh8BtoSsq/1oHgmflMZMwJ1U6C 9EgGldVImJRmXbJXm70GnTM27rS5uwuRZecKqhCgsWpkezE+iUIbnCg4PeyBTZYiG1TJ c81j1mg7QHw4j/JvaAVhy8bKtmtbXJ3wP8lfmXX6J+YJkAS2d2FHLvft211VWy8/uOhi VpN7kU0Vq+SjsasCYJTaepYraMXLAbyToVQ9njOuLH2pShYCwPHQJsk6ZwuCANhbWiaU va59SQP7UUqxpqfIcjKRPP9Io5uD6mTbyKXuOdCbNtr4R6uVWehHwTHDFqq34xLtC05v DsCw== |
| MIME-Version | 1.0 |
| X-Received | by 10.140.21.229 with SMTP id 92mr16695266qgl.33.1422042979541; Fri, 23 Jan 2015 11:56:19 -0800 (PST) |
| In-Reply-To | <873871fgxk.fsf@elektro.pacujo.net> |
| References | <54C27E13.5090808@ntlworld.com> <mailman.18046.1422035592.18130.python-list@python.org> <873871fgxk.fsf@elektro.pacujo.net> |
| Date | Sat, 24 Jan 2015 06:56:19 +1100 |
| Subject | Re: Case-insensitive sorting of strings (Python newbie) |
| From | Chris Angelico <rosuav@gmail.com> |
| Cc | "python-list@python.org" <python-list@python.org> |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | quoted-printable |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.18057.1422042982.18130.python-list@python.org> (permalink) |
| Lines | 53 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1422042982 news.xs4all.nl 2965 [2001:888:2000:d::a6]:53530 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:84392 |
Show key headers only | View raw
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
> >>> print("re\u0301sume\u0301")
> résumé
> >>> print("r\u00e9sum\u00e9")
> résumé
> >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
> False
> >>> print("\ufb01nd")
> find
> >>> print("find")
> find
> >>> print("\ufb01nd" == "find")
> False
>
> If equality can't be determined, words really can't be sorted.
Ah, that's a bit easier to deal with. Just use Unicode normalization.
>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True
It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:
def key(s):
"""Normalize a Unicode string for comparison purposes.
Composes, case-folds, and trims excess spaces.
"""
return unicodedata.normalize("NFC",s).strip().casefold()
Then it's much tidier:
>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True
You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.
ChrisA
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Re: Case-insensitive sorting of strings (Python newbie) Peter Otten <__peter__@web.de> - 2015-01-23 18:53 +0100
Re: Case-insensitive sorting of strings (Python newbie) Marko Rauhamaa <marko@pacujo.net> - 2015-01-23 21:14 +0200
Re: Case-insensitive sorting of strings (Python newbie) Chris Angelico <rosuav@gmail.com> - 2015-01-24 06:56 +1100
Re: Case-insensitive sorting of strings (Python newbie) wxjmfauth@gmail.com - 2015-01-24 02:34 -0800
csiph-web