Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'anyway.': 0.05; 'subject:Python': 0.06; '"""': 0.07; 'string': 0.09; 'spaces': 0.09; 'translate': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'def': 0.12; 'jan': 0.12; '24,': 0.16; 'ah,': 0.16; 'comparisons,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'instance:': 0.16; 'subject:Case': 0.16; 'subject:insensitive': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'bit': 0.19; '>>>': 0.22; 'cc:addr:python.org': 0.22; 'certainly': 0.24; 'unicode': 0.24; 'cc:2**0': 0.24; 'world,': 0.26; 'header:In-Reply-To:1': 0.27; 'skip:p 30': 0.29; 'am,': 0.29; 'words': 0.29; 'message-id:@mail.gmail.com': 0.30; 'work.': 0.31; 'easier': 0.31; 'breaking': 0.31; 'comparison': 0.31; 'equality': 0.31; 'strip': 0.31; 'probably': 0.32; 'maybe': 0.34; 'subject: (': 0.35; 'basic': 0.35; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'really': 0.36; 'false': 0.36; 'doing': 0.36; 'depends': 0.38; 'skip:p 20': 0.39; "you're": 0.61; "you'll": 0.62; 'skip:n 10': 0.64; 'nobody': 0.68; 'safe': 0.72; 'further,': 0.74; '2015': 0.84; 'etc,': 0.84; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=mEnzEN4w8GTZI3pCuDPO0YumT6BwYoE26850U2zOcu0=; b=eCUJwr4qno2gOIsD4jf1O0+IiSteJw/fD1xsbh6Peh8BtoSsq/1oHgmflMZMwJ1U6C 9EgGldVImJRmXbJXm70GnTM27rS5uwuRZecKqhCgsWpkezE+iUIbnCg4PeyBTZYiG1TJ c81j1mg7QHw4j/JvaAVhy8bKtmtbXJ3wP8lfmXX6J+YJkAS2d2FHLvft211VWy8/uOhi VpN7kU0Vq+SjsasCYJTaepYraMXLAbyToVQ9njOuLH2pShYCwPHQJsk6ZwuCANhbWiaU va59SQP7UUqxpqfIcjKRPP9Io5uD6mTbyKXuOdCbNtr4R6uVWehHwTHDFqq34xLtC05v DsCw== MIME-Version: 1.0 X-Received: by 10.140.21.229 with SMTP id 92mr16695266qgl.33.1422042979541; Fri, 23 Jan 2015 11:56:19 -0800 (PST) In-Reply-To: <873871fgxk.fsf@elektro.pacujo.net> References: <54C27E13.5090808@ntlworld.com> <873871fgxk.fsf@elektro.pacujo.net> Date: Sat, 24 Jan 2015 06:56:19 +1100 Subject: Re: Case-insensitive sorting of strings (Python newbie) From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 53 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1422042982 news.xs4all.nl 2965 [2001:888:2000:d::a6]:53530 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:84392 On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa wrote: > Well, if Python can't, then who can? Probably nobody in the world, not > generically, anyway. > > Example: > > >>> print("re\u0301sume\u0301") > r=C3=A9sum=C3=A9 > >>> print("r\u00e9sum\u00e9") > r=C3=A9sum=C3=A9 > >>> print("re\u0301sume\u0301" =3D=3D "r\u00e9sum\u00e9") > False > >>> print("\ufb01nd") > find > >>> print("find") > find > >>> print("\ufb01nd" =3D=3D "find") > False > > If equality can't be determined, words really can't be sorted. Ah, that's a bit easier to deal with. Just use Unicode normalization. >>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") =3D=3D unicoded= ata.normalize("NFC","r\u00e9sum\u00e9")) True It's a bit verbose, but if you're doing a lot of comparisons, you probably want to make a key-function that folds together everything that you want to be treated the same way, for instance: def key(s): """Normalize a Unicode string for comparison purposes. Composes, case-folds, and trims excess spaces. """ return unicodedata.normalize("NFC",s).strip().casefold() Then it's much tidier: >>> print(key("re\u0301sume\u0301") =3D=3D key("r\u00e9sum\u00e9")) True >>> print(key("\ufb01nd") =3D=3D key("find")) True You may want to go further, too; for search comparisons, you'll want to use NFKC normalization, and probably translate all strings of Unicode whitespace into single U+0020s, or completely strip out zero-width non-breaking spaces (and maybe zero-width breaking spaces, too), etc, etc. It all depends on what you mean by "equality". But certainly a basic NFC or NFD normalization is safe for general work. ChrisA