Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #29136

Re: [SOLVED] Least-lossy string.encode to us-ascii?

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <vlastimil.brom@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'subject:: [': 0.03; 'subject:skip:s 10': 0.05; 'ascii': 0.07; 'elegant': 0.07; 'processing.': 0.07; 'preferable': 0.09; 'received:mail- qc0-f174.google.com': 0.09; 'cases': 0.15; '-tkc': 0.16; 'ordinal': 0.16; 'preprocess': 0.16; 'threw': 0.16; 'url:its': 0.16; 'wrote:': 0.17; 'tim': 0.18; 'url:edu': 0.18; '>>>': 0.18; 'subject:] ': 0.19; 'appropriate': 0.20; 'import': 0.21; 'least': 0.25; 'header:In-Reply-To:1': 0.25; 'appear': 0.26; 'skip:" 20': 0.26; 'thanks!': 0.26; 'handling': 0.27; 'message- id:@mail.gmail.com': 0.27; 'chase': 0.29; '8bit%:5': 0.29; 'e.g.': 0.30; 'function': 0.30; 'lists': 0.31; 'text,': 0.33; 'to:addr :python-list': 0.33; 'hi,': 0.33; 'agree': 0.34; 'received:google.com': 0.34; 'doing': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'skip:u 20': 0.36; 'but': 0.36; 'characters': 0.36; 'anything': 0.36; 'possible': 0.37; 'received:209': 0.37; 'received:209.85.216': 0.37; 'well.': 0.37; 'to:addr:python.org': 0.39; 'subject:-': 0.40; 'header:Received:5': 0.40; 'lost': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'more': 0.63; 'charset:windows-1252': 0.65; '8bit%:27': 0.71; 'to:name:python': 0.84; 'indicator': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=WVmwhC/jpQR2f+OUpmdXMA0lsvjghdYhdra7bnwtK+0=; b=J+MLf8zfpzT+z74RHvZNQxHogcRoQlmWcZ12Cya4AZ3DZ2w2DoyK7EG0iMF5hIPkJm yO7z5e8IORidQ/oVQXDnpJ0hXeLPhlb+S0wwllZcX8KiicQh4vGi4AEnVubHwrlctK+B WWgT1ce4fXyV7Ebe1vqaeevBefjcitPb6SpOIyhw9/YxDUykiuONzxo/MM6RIGEqy9zY SljbrCym9jKfWN26KOMlpInxHjUNrmX+Ivr9s6TK1syBBqOxmFOckyswCZhnfDO/bq8N YxCRBvPNK6VXY+ffedUZerqiUjun6hkbydHG9t6D6+1wkHbAD3FL+NwxRXOsvO4bKwjD X3dg==
MIME-Version 1.0
In-Reply-To <505258E1.7020500@tim.thechases.com>
References <50524F6F.6070604@tim.thechases.com> <CAHzaPEMyvFUR+f6zC8=LJwOOy4bricM=e0h-mYN4Aukb86Dz7w@mail.gmail.com> <505258E1.7020500@tim.thechases.com>
Date Fri, 14 Sep 2012 09:38:33 +0200
Subject Re: [SOLVED] Least-lossy string.encode to us-ascii?
From Vlastimil Brom <vlastimil.brom@gmail.com>
To Python <python-list@python.org>
Content-Type text/plain; charset=windows-1252
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.683.1347608316.27098.python-list@python.org> (permalink)
Lines 43
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1347608316 news.xs4all.nl 6867 [2001:888:2000:d::a6]:33146
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:29136

Show key headers only | View raw


2012/9/14 Tim Chase <python.list@tim.thechases.com>:
> On 09/13/12 16:44, Vlastimil Brom wrote:
>> >>> import unicodedata
>> >>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", "ignore").decode("ascii")
>> u'servico movil'
>
> Works well for all the test-cases I threw at it.  Thanks!
>
> -tkc
>
>

Hi,
I am glad, it works, but I agree with the other comments, that it
would be preferable to keep the original accented text, if at all
possible in the whole processing.
The above works by decomposing the accented characters into "basic"
characters and the bare accents (combining diacritics) using
normalize() and just striping anything outside ascii in encode("...",
"ignore")
This works for "combinable" accents, and most of the Portuguese
characters outside of ascii appear to fall into this category, but
there are others as well.
E.g. according to
http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html
there are at least ºª«»€, which would be lost completely in such conversion.
ª (dec.: 170)  (hex.: 0xaa) # FEMININE ORDINAL INDICATOR
º (dec.: 186)  (hex.: 0xba) # MASCULINE ORDINAL INDICATOR

You can preprocess such cases as appropriate before doing the
conversion, e.g. just:

>>> u"ºª«»€".replace(u"º", u"o").replace(u"ª", u"a").replace(u"«", u'"').replace(u"»", u'"').replace(u"€", u"EUR")
u'oa""EUR'
>>>
or using a more elegant function and the replacement lists (eventually
handling other cases as well).

regards,
   vbr

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: [SOLVED] Least-lossy string.encode to us-ascii? Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-09-14 09:38 +0200

csiph-web