Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder7.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'encoding': 0.05; 'tries': 0.07; 'utf-8': 0.07; 'string': 0.09; 'bytes.': 0.09; 'english,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'skip:\\ 40': 0.09; '2.7': 0.14; 'at.': 0.16; 'encoding.': 0.16; 'encodings': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:issue': 0.16; 'language': 0.16; 'wrote:': 0.18; '>>>': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'bytes': 0.24; 'server.': 0.24; 'header:X-Complaints-To:1': 0.27; 'tried': 0.27; "skip:' 10": 0.31; 'you?': 0.31; '3.x': 0.31; 'obscure': 0.31; 'linux': 0.33; 'url:python': 0.33; 'sense': 0.34; 'maybe': 0.34; 'skip:d 20': 0.34; 'no,': 0.35; 'but': 0.35; 'version': 0.36; 'in.': 0.36; 'url:org': 0.36; 'should': 0.36; 'skip:\xcf 20': 0.38; 'version,': 0.38; 'whatever': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'happen': 0.63; 'default': 0.69; '8bit%:100': 0.72; '2.7.': 0.84; 'pertaining': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Dave Angel Subject: Re: UnicodeDecodeError issue Date: Mon, 2 Sep 2013 11:38:04 +0000 (UTC) References: <5222fc40$0$6599$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Gmane-NNTP-Posting-Host: 174.32.174.36 User-Agent: XPN/1.2.6 (Street Spirit ; Linux) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 43 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1378121913 news.xs4all.nl 15913 [2001:888:2000:d::a6]:55458 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:53474 On 2/9/2013 00:16, Ferrous Cranus wrote: >> >> Have you tried to decode those bytes in various encodings other than >> utf-8 ? > > > No, because i wasn't aware of what string/variable they were pertaining at. > > http://pypi.python.org/pypi/chardet is a package which tries to 'guess' an encoding for a string of bytes. I happen to have the 2.7 version installed, but not the 3.x version, so the following is in 2.7. Same thing should work in 3.3.... >>> chardet.detect(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2') {'confidence': 0.9638983132261467, 'encoding': 'windows-1253'} >>> print b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2'.decode('windows-1253') ¶γνωστοόνομα συστήματος I don't have a clue what it might be; it's not English, and I don't know whatever language it may be in. Does that string make any sense to you? You may want to try it on your own machine, since the email may obscure the encoding. Or you might want to do the decode using whatever the default encoding is for that server. The Linux 'file' utility thinks this string is in ISO-8859, so you might want to try a decode('ISO-8859-1') as well. (and maybe ISO-8859-2, -3, -4, and -5) -- DaveA