Path: csiph.com!eeepc.pasdenom.info!news.pasdenom.info!news.dougwise.org!aioe.org!feeder.news-service.com!news2.euro.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'bug': 0.02; 'url:bugs': 0.05; 'fine,': 0.07; 'lindberg': 0.07; 'tracker': 0.07; 'bash': 0.09; 'encoding:': 0.09; 'skip:" 60': 0.09; 'subject:parser': 0.09; 'utf-8': 0.09; 'pm,': 0.12; 'subject:error': 0.12; 'tue,': 0.12; 'error:': 0.13; 'wrote:': 0.14; 'traceback': 0.15; 'behavior?': 0.16; 'codec': 0.16; 'filesystem': 0.16; 'helpful,': 0.16; 'open()': 0.16; 'ordinal': 0.16; 'rebert': 0.16; 'skip:\xc2 30': 0.16; 'workaround:': 0.16; 'feb': 0.16; '(most': 0.17; 'seems': 0.17; 'tried': 0.22; 'last):': 0.23; 'reference.': 0.23; 'header:In-Reply-To:1': 0.23; 'pointing': 0.24; 'script': 0.25; 'runs': 0.27; 'instead': 0.28; "can't": 0.28; 'problem': 0.28; 'character': 0.28; 'message-id:@mail.gmail.com': 0.29; 'thanks': 0.30; 'figures': 0.30; 'import': 0.31; 'to:addr:python-list': 0.31; 'received:209.85.161.46': 0.32; 'received:mail- fx0-f46.google.com': 0.32; 'error': 0.32; 'received:209.85.161': 0.33; 'skip:p 30': 0.33; 'got': 0.33; 'file': 0.34; '"",': 0.34; 'parse': 0.34; 'there': 0.34; 'position': 0.34; 'but': 0.36; 'fine.': 0.36; 'report': 0.36; 'skip:s 20': 0.38; 'works': 0.38; 'received:209.85': 0.38; 'received:google.com': 0.38; 'url:org': 0.38; 'subject:: ': 0.39; 'expected': 0.39; 'url:python': 0.39; 'to:addr:python.org': 0.40; 'still': 0.40; 'your': 0.61; '8bit%:17': 0.64; 'here': 0.64; 'sax': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=TGqaZoyZ7BnQt6H7xTSRGkADGHI/Bjg+Qy6fPyWQf9Y=; b=vUYfQ0FmYE5kEFQqvkmCKvapAGHVoEg+vj8ulxfHUS20Qj5VFxFBmJMd8aPk+l/i0C ya5sq53IKy3ommf3Wfi687Wrt5PGEo2xH91TXXV9Kb3iF5EkiEM6+w8Hi0ujPScHIWYW Inu6FHhwpXok76LqvSOhwUrQjYO5p13ajl/c4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=n0pw/b2aZ3WeAC6B8FJcASnd2ND0+u/8jvyXw72ofFwJHQnGxKrWavN3gaUtmU8Ul5 hv2jsHxVDR7VZHzmP1S280ujDjDHyhlF4QX7yNUOWvmQfZqtenC02UgoblXinuZVZTzE rafleqPqIhiUs+jX/8ybnd+n17+hHqnQM9K4E= MIME-Version: 1.0 In-Reply-To: References: Date: Wed, 9 Feb 2011 09:32:14 +0100 Subject: Re: Unicode error in sax parser From: Rickard Lindberg To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 60 NNTP-Posting-Host: 82.94.164.166 X-Trace: 1297240336 news.xs4all.nl 41102 [::ffff:82.94.164.166]:40512 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:55831 On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote: >> Here is a bash script to reproduce my error: > > Including the error message and traceback is still helpful, for future > reference. Thanks for pointing it out. >> =C2=A0 =C2=A0#!/bin/sh >> >> =C2=A0 =C2=A0cat > =C3=A5.timeline < >> =C2=A0 =C2=A0EOF >> >> =C2=A0 =C2=A0python <> =C2=A0 =C2=A0# encoding: utf-8 >> =C2=A0 =C2=A0from xml.sax import parse >> =C2=A0 =C2=A0from xml.sax.handler import ContentHandler >> =C2=A0 =C2=A0parse(u"=C3=A5.timeline", ContentHandler()) >> =C2=A0 =C2=A0EOF >> >> If I instead do >> >> =C2=A0 =C2=A0parse(u"=C3=A5.timeline".encode("utf-8"), ContentHandler()) >> >> the script runs without errors. >> >> Is this a bug or expected behavior? > > Bug; open() figures out the filesystem encoding just fine. > Bug tracker to report the issue to: http://bugs.python.org/ > > Workaround: > parse(open(u"=C3=A5.timeline", 'r'), ContentHandler()) When I tried your workaround, I still got this error: Traceback (most recent call last): File "", line 4, in File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse parser.parse(filename_or_stream) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse self.prepareParser(source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser self._parser.SetBase(source.getSystemId()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) The open(..) part works fine, but there still seems to be a problem inside = the sax parser. --=20 Rickard Lindberg