Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #57659

Re: trying to strip out non ascii.. or rather convert non ascii

Date 2013-10-26 22:07 +0100
From MRAB <python@mrabarnett.plus.com>
Subject Re: trying to strip out non ascii.. or rather convert non ascii
References <CAP16ngos=CSQuN8+dTK1Kh0d=DzQXeFRG6sMmt+AC0d3=r=Tzw@mail.gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.1606.1382821674.18130.python-list@python.org> (permalink)

Show all headers | View raw


On 26/10/2013 21:11, bruce wrote:
> hi..
>
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
>
> I'd like to convert a string like this::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
> Iliana</a></div>
>
> to::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
> Iliana</a></div>
>
> where I convert the
> " á " to " a"
>
> which appears to be a shift of 128, but I'm not sure how to accomplish this..
>
> I've tested using the different decode/encode functions using
> utf-8/ascii with no luck.
>
> I've reviewed stack overflow, as well as a few other sites, but
> haven't hit the aha moment.
>
> pointers/comments would be welcome.
>
Why do you want to do that?

The short answer is that you should accept that these days you should
be using Unicode, not ASCII.

The longer answer is that you could normalise the Unicode codepoints to
the NFKD form and then discard any codepoints outside the ASCII range:

>>> import unicodedata
>>> t = unicodedata.normalize("NFKD", "Alcántar")
>>> "".join(c for c in t if ord(c) < 0x80)
'Alcantar'

The disadvantage, of course, is that it'll throw away a whole lot of
codepoints that can't be 'converted'.

Have a look at Unidecode:

http://pypi.python.org/pypi/Unidecode

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: trying to strip out non ascii.. or rather convert non ascii MRAB <python@mrabarnett.plus.com> - 2013-10-26 22:07 +0100

csiph-web