Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.047 X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'ignored': 0.07; 'string': 0.09; 'subject:trying': 0.09; 'python': 0.11; "wouldn't": 0.14; '-tkc': 0.16; 'comparison.': 0.16; 'from:addr:python.list': 0.16; 'from:addr:tim.thechases.com': 0.16; 'from:name:tim chase': 0.16; 'subject:non': 0.16; 'throw': 0.16; 'folks': 0.16; ':-)': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'case.': 0.24; "i've": 0.25; 'header:In-Reply-To:1': 0.27; 'code': 0.31; "d'aprano": 0.31; 'information?': 0.31; 'searches': 0.31; 'steven': 0.31; "i'd": 0.34; 'something': 0.35; 'case,': 0.35; 'but': 0.35; 'earth': 0.36; 'needed': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'even': 0.60; 'skip:u 10': 0.60; 'most': 0.60; 'matter': 0.61; 'back': 0.62; 'skip:n 10': 0.64; 'between': 0.67; 'results': 0.69; 'containing': 0.69; 'fact,': 0.69; 'heavy': 0.81; 'received:50.22': 0.84; 'subject:.. ': 0.84; 'visually': 0.84 Date: Sat, 26 Oct 2013 20:41:58 -0500 From: Tim Chase To: python-list@python.org Subject: Re: trying to strip out non ascii.. or rather convert non ascii In-Reply-To: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> References: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - boston.accountservergroup.com X-AntiAbuse: Original Domain - python.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tim.thechases.com X-Get-Message-Sender-Via: boston.accountservergroup.com: authenticated_id: tim@thechases.com X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 38 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1382838024 news.xs4all.nl 15943 [2001:888:2000:d::a6]:57369 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57695 On 2013-10-26 22:24, Steven D'Aprano wrote: > Why on earth would you want to throw away perfectly good > information?=20 The main reason I've needed to do it in the past is for normalization of search queries. When a user wants to find something containing "ping=C3=BCino", I want to have those results come back even if they type "pinguino" in the search box. For the same reason searches are often normalized to ignore case. The difference between "Polish" and "polish" is visually just capitalization, but most folks don't think twice about if term.upper() in datum.upper(): it_matches() I'd be just as happy if Python provided a "sloppy string compare" that ignored case, diacritical marks, and the like. unicode_haystack1 =3D u"ping=C3=BCino" unicode_haystack2 =3D u"=C2=A1Mir=C3=A9 un ping=C3=BCino!" needle =3D u"pinguino" if unicode_haystack1.sloppy_equals(needle): it_matches() if unicode_haystack2.sloppy_contains(needle): it_contains() As a matter of fact, I'd even be happier if Python did the heavy lifting, since I wouldn't have to think about whether I want my code to force upper-vs-lower for the comparison. :-) -tkc