Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'subject:: [': 0.03; 'subject:skip:s 10': 0.05; 'ascii': 0.07; 'elegant': 0.07; 'processing.': 0.07; 'preferable': 0.09; 'received:mail- qc0-f174.google.com': 0.09; 'cases': 0.15; '-tkc': 0.16; 'ordinal': 0.16; 'preprocess': 0.16; 'threw': 0.16; 'url:its': 0.16; 'wrote:': 0.17; 'tim': 0.18; 'url:edu': 0.18; '>>>': 0.18; 'subject:] ': 0.19; 'appropriate': 0.20; 'import': 0.21; 'least': 0.25; 'header:In-Reply-To:1': 0.25; 'appear': 0.26; 'skip:" 20': 0.26; 'thanks!': 0.26; 'handling': 0.27; 'message- id:@mail.gmail.com': 0.27; 'chase': 0.29; '8bit%:5': 0.29; 'e.g.': 0.30; 'function': 0.30; 'lists': 0.31; 'text,': 0.33; 'to:addr :python-list': 0.33; 'hi,': 0.33; 'agree': 0.34; 'received:google.com': 0.34; 'doing': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'skip:u 20': 0.36; 'but': 0.36; 'characters': 0.36; 'anything': 0.36; 'possible': 0.37; 'received:209': 0.37; 'received:209.85.216': 0.37; 'well.': 0.37; 'to:addr:python.org': 0.39; 'subject:-': 0.40; 'header:Received:5': 0.40; 'lost': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'more': 0.63; 'charset:windows-1252': 0.65; '8bit%:27': 0.71; 'to:name:python': 0.84; 'indicator': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=WVmwhC/jpQR2f+OUpmdXMA0lsvjghdYhdra7bnwtK+0=; b=J+MLf8zfpzT+z74RHvZNQxHogcRoQlmWcZ12Cya4AZ3DZ2w2DoyK7EG0iMF5hIPkJm yO7z5e8IORidQ/oVQXDnpJ0hXeLPhlb+S0wwllZcX8KiicQh4vGi4AEnVubHwrlctK+B WWgT1ce4fXyV7Ebe1vqaeevBefjcitPb6SpOIyhw9/YxDUykiuONzxo/MM6RIGEqy9zY SljbrCym9jKfWN26KOMlpInxHjUNrmX+Ivr9s6TK1syBBqOxmFOckyswCZhnfDO/bq8N YxCRBvPNK6VXY+ffedUZerqiUjun6hkbydHG9t6D6+1wkHbAD3FL+NwxRXOsvO4bKwjD X3dg== MIME-Version: 1.0 In-Reply-To: <505258E1.7020500@tim.thechases.com> References: <50524F6F.6070604@tim.thechases.com> <505258E1.7020500@tim.thechases.com> Date: Fri, 14 Sep 2012 09:38:33 +0200 Subject: Re: [SOLVED] Least-lossy string.encode to us-ascii? From: Vlastimil Brom To: Python Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 43 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1347608316 news.xs4all.nl 6867 [2001:888:2000:d::a6]:33146 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:29136 2012/9/14 Tim Chase : > On 09/13/12 16:44, Vlastimil Brom wrote: >> >>> import unicodedata >> >>> unicodedata.normalize("NFD", u"servi=E7o m=F3vil").encode("ascii", "= ignore").decode("ascii") >> u'servico movil' > > Works well for all the test-cases I threw at it. Thanks! > > -tkc > > Hi, I am glad, it works, but I agree with the other comments, that it would be preferable to keep the original accented text, if at all possible in the whole processing. The above works by decomposing the accented characters into "basic" characters and the bare accents (combining diacritics) using normalize() and just striping anything outside ascii in encode("...", "ignore") This works for "combinable" accents, and most of the Portuguese characters outside of ascii appear to fall into this category, but there are others as well. E.g. according to http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html there are at least =BA=AA=AB=BB=80, which would be lost completely in such = conversion. =AA (dec.: 170) (hex.: 0xaa) # FEMININE ORDINAL INDICATOR =BA (dec.: 186) (hex.: 0xba) # MASCULINE ORDINAL INDICATOR You can preprocess such cases as appropriate before doing the conversion, e.g. just: >>> u"=BA=AA=AB=BB=80".replace(u"=BA", u"o").replace(u"=AA", u"a").replace(= u"=AB", u'"').replace(u"=BB", u'"').replace(u"=80", u"EUR") u'oa""EUR' >>> or using a more elegant function and the replacement lists (eventually handling other cases as well). regards, vbr