Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #29136 > unrolled thread

Re: [SOLVED] Least-lossy string.encode to us-ascii?

Started byVlastimil Brom <vlastimil.brom@gmail.com>
First post2012-09-14 09:38 +0200
Last post2012-09-14 09:38 +0200
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: [SOLVED] Least-lossy string.encode to us-ascii? Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-09-14 09:38 +0200

#29136 — Re: [SOLVED] Least-lossy string.encode to us-ascii?

FromVlastimil Brom <vlastimil.brom@gmail.com>
Date2012-09-14 09:38 +0200
SubjectRe: [SOLVED] Least-lossy string.encode to us-ascii?
Message-ID<mailman.683.1347608316.27098.python-list@python.org>
2012/9/14 Tim Chase <python.list@tim.thechases.com>:
> On 09/13/12 16:44, Vlastimil Brom wrote:
>> >>> import unicodedata
>> >>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", "ignore").decode("ascii")
>> u'servico movil'
>
> Works well for all the test-cases I threw at it.  Thanks!
>
> -tkc
>
>

Hi,
I am glad, it works, but I agree with the other comments, that it
would be preferable to keep the original accented text, if at all
possible in the whole processing.
The above works by decomposing the accented characters into "basic"
characters and the bare accents (combining diacritics) using
normalize() and just striping anything outside ascii in encode("...",
"ignore")
This works for "combinable" accents, and most of the Portuguese
characters outside of ascii appear to fall into this category, but
there are others as well.
E.g. according to
http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html
there are at least ºª«»€, which would be lost completely in such conversion.
ª (dec.: 170)  (hex.: 0xaa) # FEMININE ORDINAL INDICATOR
º (dec.: 186)  (hex.: 0xba) # MASCULINE ORDINAL INDICATOR

You can preprocess such cases as appropriate before doing the
conversion, e.g. just:

>>> u"ºª«»€".replace(u"º", u"o").replace(u"ª", u"a").replace(u"«", u'"').replace(u"»", u'"').replace(u"€", u"EUR")
u'oa""EUR'
>>>
or using a more elegant function and the replacement lists (eventually
handling other cases as well).

regards,
   vbr

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web