Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'that?': 0.05; 'discard': 0.07; 'welcome.': 0.07; 'string': 0.09; 'ascii': 0.09; 'subject:trying': 0.09; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'guessing': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'subject:non': 0.16; 'throw': 0.16; 'unicode,': 0.16; 'unicode.': 0.16; 'wrote:': 0.18; 'stack': 0.19; '>>>': 0.22; 'appears': 0.22; 'import': 0.22; 'bruce': 0.22; 'header:User-Agent:1': 0.23; 'unicode': 0.24; "haven't": 0.24; "i've": 0.25; 'header:In-Reply-To:1': 0.27; "i'm": 0.30; 'getting': 0.31; 'url:python': 0.33; "i'd": 0.34; 'could': 0.34; "can't": 0.35; 'skip:u 20': 0.35; 'convert': 0.35; 'received:84': 0.35; 'but': 0.35; 'url:org': 0.36; 'should': 0.36; 'to:addr:python-list': 0.38; 'files': 0.38; 'short': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'days': 0.60; 'reviewed': 0.60; 'different': 0.65; 'header:Reply-To:1': 0.67; 'reply-to:no real name:2**0': 0.71; '128,': 0.84; 'reply- to:addr:python.org': 0.84; 'subject:.. ': 0.84 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=PIY2p5aC c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=05MOGALpfCEA:10 a=aSj0zaec774A:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=pBCL1Z2eCLIA:10 a=OcLTys_dT0D2R0b60eoA:9 a=wPNLvfGTeEIA:10 a=NWVoK91CQyQA:10 X-AUTH: mrabarnett:2500 Date: Sat, 26 Oct 2013 22:07:58 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: python-list@python.org Subject: Re: trying to strip out non ascii.. or rather convert non ascii References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 47 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1382821674 news.xs4all.nl 15897 [2001:888:2000:d::a6]:39175 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:57659 On 26/10/2013 21:11, bruce wrote: > hi.. > > getting some files via curl, and want to convert them from what i'm > guessing to be unicode. > > I'd like to convert a string like this:: >

> > to:: >

> > where I convert the > " á " to " a" > > which appears to be a shift of 128, but I'm not sure how to accomplish this.. > > I've tested using the different decode/encode functions using > utf-8/ascii with no luck. > > I've reviewed stack overflow, as well as a few other sites, but > haven't hit the aha moment. > > pointers/comments would be welcome. > Why do you want to do that? The short answer is that you should accept that these days you should be using Unicode, not ASCII. The longer answer is that you could normalise the Unicode codepoints to the NFKD form and then discard any codepoints outside the ASCII range: >>> import unicodedata >>> t = unicodedata.normalize("NFKD", "Alcántar") >>> "".join(c for c in t if ord(c) < 0x80) 'Alcantar' The disadvantage, of course, is that it'll throw away a whole lot of codepoints that can't be 'converted'. Have a look at Unidecode: http://pypi.python.org/pypi/Unidecode