Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #57659

Re: trying to strip out non ascii.. or rather convert non ascii

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python@mrabarnett.plus.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'that?': 0.05; 'discard': 0.07; 'welcome.': 0.07; 'string': 0.09; 'ascii': 0.09; 'subject:trying': 0.09; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'guessing': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'subject:non': 0.16; 'throw': 0.16; 'unicode,': 0.16; 'unicode.': 0.16; 'wrote:': 0.18; 'stack': 0.19; '>>>': 0.22; 'appears': 0.22; 'import': 0.22; 'bruce': 0.22; 'header:User-Agent:1': 0.23; 'unicode': 0.24; "haven't": 0.24; "i've": 0.25; 'header:In-Reply-To:1': 0.27; "i'm": 0.30; 'getting': 0.31; 'url:python': 0.33; "i'd": 0.34; 'could': 0.34; "can't": 0.35; 'skip:u 20': 0.35; 'convert': 0.35; 'received:84': 0.35; 'but': 0.35; 'url:org': 0.36; 'should': 0.36; 'to:addr:python-list': 0.38; 'files': 0.38; 'short': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'days': 0.60; 'reviewed': 0.60; 'different': 0.65; 'header:Reply-To:1': 0.67; 'reply-to:no real name:2**0': 0.71; '128,': 0.84; 'reply- to:addr:python.org': 0.84; 'subject:.. ': 0.84
X-CM-Score 0.00
X-CNFS-Analysis v=2.1 cv=PIY2p5aC c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=05MOGALpfCEA:10 a=aSj0zaec774A:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=pBCL1Z2eCLIA:10 a=OcLTys_dT0D2R0b60eoA:9 a=wPNLvfGTeEIA:10 a=NWVoK91CQyQA:10
X-AUTH mrabarnett:2500
Date Sat, 26 Oct 2013 22:07:58 +0100
From MRAB <python@mrabarnett.plus.com>
User-Agent Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version 1.0
To python-list@python.org
Subject Re: trying to strip out non ascii.. or rather convert non ascii
References <CAP16ngos=CSQuN8+dTK1Kh0d=DzQXeFRG6sMmt+AC0d3=r=Tzw@mail.gmail.com>
In-Reply-To <CAP16ngos=CSQuN8+dTK1Kh0d=DzQXeFRG6sMmt+AC0d3=r=Tzw@mail.gmail.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 8bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
Reply-To python-list@python.org
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1606.1382821674.18130.python-list@python.org> (permalink)
Lines 47
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1382821674 news.xs4all.nl 15897 [2001:888:2000:d::a6]:39175
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:57659

Show key headers only | View raw


On 26/10/2013 21:11, bruce wrote:
> hi..
>
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
>
> I'd like to convert a string like this::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
> Iliana</a></div>
>
> to::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
> Iliana</a></div>
>
> where I convert the
> " á " to " a"
>
> which appears to be a shift of 128, but I'm not sure how to accomplish this..
>
> I've tested using the different decode/encode functions using
> utf-8/ascii with no luck.
>
> I've reviewed stack overflow, as well as a few other sites, but
> haven't hit the aha moment.
>
> pointers/comments would be welcome.
>
Why do you want to do that?

The short answer is that you should accept that these days you should
be using Unicode, not ASCII.

The longer answer is that you could normalise the Unicode codepoints to
the NFKD form and then discard any codepoints outside the ASCII range:

>>> import unicodedata
>>> t = unicodedata.normalize("NFKD", "Alcántar")
>>> "".join(c for c in t if ord(c) < 0x80)
'Alcantar'

The disadvantage, of course, is that it'll throw away a whole lot of
codepoints that can't be 'converted'.

Have a look at Unidecode:

http://pypi.python.org/pypi/Unidecode

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: trying to strip out non ascii.. or rather convert non ascii MRAB <python@mrabarnett.plus.com> - 2013-10-26 22:07 +0100

csiph-web