Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeder2.ecngs.de!ecngs!feeder.ecngs.de!xlned.com!feeder1.xlned.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'cpython': 0.05; 'try:': 0.07; 'exits': 0.09; 'information?': 0.09; 'input,': 0.09; 'cc:addr:python-list': 0.10; '2.7': 0.13; 'encoding': 0.15; '3.2)': 0.16; '3.2,': 0.16; 'encodings': 0.16; 'filename:fname piece:signature': 0.16; 'occurred.': 0.16; 'routinely': 0.16; 'wrote:': 0.17; 'bytes': 0.17; 'stefan': 0.17; 'input': 0.18; 'tells': 0.22; 'cc:2**0': 0.23; 'cc:no real name:2**0': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'skip:b 30': 0.27; 'actual': 0.28; 'end,': 0.29; 'far.': 0.29; 'error': 0.30; 'code': 0.31; 'received:192.168.2': 0.34; 'pm,': 0.35; 'except': 0.36; 'but': 0.36; 'test': 0.36; 'does': 0.37; 'uses': 0.37; 'ones': 0.37; 'subject:: ': 0.38; 'received:192': 0.39; 'where': 0.40; 'received:192.168': 0.40; 'skip:u 10': 0.60; 'most': 0.61; 'upper': 0.75; 'ude': 0.84 Date: Thu, 26 Jul 2012 14:17:44 +0200 From: Philipp Hagemeister User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.5) Gecko/20120624 Icedove/10.0.5 MIME-Version: 1.0 To: Stefan Behnel Subject: Re: catch UnicodeDecodeError References: <04f7ff8d-9881-4a04-ab2e-b5573b5f3cd1@googlegroups.com> <38f5cdaf-c021-4ccd-8fcb-c68b21d3aeb2@w24g2000vby.googlegroups.com> <17bf754d-b1e9-4bb7-bf42-190325ee969a@q29g2000vby.googlegroups.com> In-Reply-To: X-Enigmail-Version: 1.4 OpenPGP: id=FAFB085C Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="------------enigE4A9A193510395D3B1946045" Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 52 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1343305079 news.xs4all.nl 6921 [2001:888:2000:d::a6]:56016 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:26085 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigE4A9A193510395D3B1946045 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 07/26/2012 01:15 PM, Stefan Behnel wrote: >> exits with a UnicodeDecodeError. > ... that tells you the exact code line where the error occurred. Which property of a UnicodeDecodeError does include that information? On cPython 2.7 and 3.2, I see only start and end, both of which refer to the number of bytes read so far. I used the followin test script: e =3D None try: b'a\xc3\xa4\nb\xff0'.decode('utf-8') except UnicodeDecodeError as ude: e =3D ude print(e.start) # 5 for this input, 3 for the input b'a\nb\xff0' print(dir(e)) But even if you would somehow determine a line number, this would only work if the actual encoding uses 0xa for newline. Most encodings (101 out of 108 applicable ones in cPython 3.2) do include 0x0a in their representation of '\n', but multi-byte encodings routinely include 0x0a bytes in their representation of non-newline characters. Therefore, the most you can do is calculate an upper bound for the line number. - Philipp --------------enigE4A9A193510395D3B1946045 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEAREKAAYFAlARNWoACgkQ9eq1gvr7CFy8kACeMeAslB7dwOIlDSOlZd7fq0TO 0o0AnAz9yvd2pErICNfJTvh+ilrqsMhC =DVt/ -----END PGP SIGNATURE----- --------------enigE4A9A193510395D3B1946045--