Re: trying to strip out non ascii.. or rather convert non ascii

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<python@mrabarnett.plus.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.002
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'url:pypi': 0.03; 'that?': 0.05; 'discard': 0.07; 'welcome.': 0.07; 'string': 0.09; 'ascii': 0.09; 'subject:trying': 0.09; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'guessing': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'subject:non': 0.16; 'throw': 0.16; 'unicode,': 0.16; 'unicode.': 0.16; 'wrote:': 0.18; 'stack': 0.19; '>>>': 0.22; 'appears': 0.22; 'import': 0.22; 'bruce': 0.22; 'header:User-Agent:1': 0.23; 'unicode': 0.24; "haven't": 0.24; "i've": 0.25; 'header:In-Reply-To:1': 0.27; "i'm": 0.30; 'getting': 0.31; 'url:python': 0.33; "i'd": 0.34; 'could': 0.34; "can't": 0.35; 'skip:u 20': 0.35; 'convert': 0.35; 'received:84': 0.35; 'but': 0.35; 'url:org': 0.36; 'should': 0.36; 'to:addr:python-list': 0.38; 'files': 0.38; 'short': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'days': 0.60; 'reviewed': 0.60; 'different': 0.65; 'header:Reply-To:1': 0.67; 'reply-to:no real name:2**0': 0.71; '128,': 0.84; 'reply- to:addr:python.org': 0.84; 'subject:.. ': 0.84
X-CM-Score	0.00
X-CNFS-Analysis	v=2.1 cv=PIY2p5aC c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=05MOGALpfCEA:10 a=aSj0zaec774A:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=pBCL1Z2eCLIA:10 a=OcLTys_dT0D2R0b60eoA:9 a=wPNLvfGTeEIA:10 a=NWVoK91CQyQA:10
X-AUTH	mrabarnett:2500
Date	Sat, 26 Oct 2013 22:07:58 +0100
From	MRAB <python@mrabarnett.plus.com>
User-Agent	Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version	1.0
To	python-list@python.org
Subject	Re: trying to strip out non ascii.. or rather convert non ascii
References	<CAP16ngos=CSQuN8+dTK1Kh0d=DzQXeFRG6sMmt+AC0d3=r=Tzw@mail.gmail.com>
In-Reply-To	<CAP16ngos=CSQuN8+dTK1Kh0d=DzQXeFRG6sMmt+AC0d3=r=Tzw@mail.gmail.com>
Content-Type	text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding	8bit
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
Reply-To	python-list@python.org
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.1606.1382821674.18130.python-list@python.org> (permalink)
Lines	47
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1382821674 news.xs4all.nl 15897 [2001:888:2000:d::a6]:39175
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:57659

Show key headers only | View raw

On 26/10/2013 21:11, bruce wrote:
> hi..
>
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
>
> I'd like to convert a string like this::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
> Iliana</a></div>
>
> to::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
> Iliana</a></div>
>
> where I convert the
> " á " to " a"
>
> which appears to be a shift of 128, but I'm not sure how to accomplish this..
>
> I've tested using the different decode/encode functions using
> utf-8/ascii with no luck.
>
> I've reviewed stack overflow, as well as a few other sites, but
> haven't hit the aha moment.
>
> pointers/comments would be welcome.
>
Why do you want to do that?

The short answer is that you should accept that these days you should
be using Unicode, not ASCII.

The longer answer is that you could normalise the Unicode codepoints to
the NFKD form and then discard any codepoints outside the ASCII range:

>>> import unicodedata
>>> t = unicodedata.normalize("NFKD", "Alcántar")
>>> "".join(c for c in t if ord(c) < 0x80)
'Alcantar'

The disadvantage, of course, is that it'll throw away a whole lot of
codepoints that can't be 'converted'.

Have a look at Unidecode:

http://pypi.python.org/pypi/Unidecode

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: trying to strip out non ascii.. or rather convert non ascii MRAB <python@mrabarnett.plus.com> - 2013-10-26 22:07 +0100

csiph-web