Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'subject:error': 0.03; 'from:addr:yahoo.co.uk': 0.04; 'continuation': 0.07; 'explicit': 0.07; 'skip:u 30': 0.07; 'utf-8': 0.07; 'lawrence': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'language.': 0.14; '"python': 0.16; 'byte,': 0.16; 'character.': 0.16; 'codec': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:UTF': 0.16; 'surrogate': 0.16; 'index': 0.16; 'language': 0.16; 'wrote:': 0.18; "python's": 0.19; 'this?': 0.23; 'header:User-Agent:1': 0.23; 'error': 0.23; 'byte': 0.24; 'tracker': 0.26; 'gets': 0.27; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'record': 0.27; 'chris': 0.29; 'raise': 0.29; 'besides': 0.30; 'compared': 0.30; '"",': 0.31; '>>>>': 0.31; 'decimal': 0.31; 'raised': 0.31; 'file': 0.32; '(most': 0.33; 'beginning': 0.33; "i'd": 0.34; 'could': 0.34; 'problem': 0.35; "can't": 0.35; 'something': 0.35; 'but': 0.35; 'sequence': 0.36; 'should': 0.36; 'to:addr:python-list': 0.38; 'issue': 0.38; 'recent': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'refer': 0.63; 'our': 0.64; 'more': 0.64; 'charset:windows-1252': 0.65; 'worth': 0.66; 'invalid': 0.68; 'hassle?': 0.84; 'improvement,': 0.84; 'pike': 0.84; 'received:as9105.com': 0.84; 'received:dsl.as9105.com': 0.84; 'received:dynamic.dsl.as9105.com': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Mark Lawrence Subject: Re: Opaque error message on UTF-8 decode Date: Sun, 08 Mar 2015 21:23:50 +0000 References: Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: 80-44-150-120.dynamic.dsl.as9105.com User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.19 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 37 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1425849845 news.xs4all.nl 2888 [2001:888:2000:d::a6]:37774 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:87162 On 08/03/2015 21:15, Chris Angelico wrote: >>>> b"\xed\xb4\x80".decode() > Traceback (most recent call last): > File "", line 1, in > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position > 0: invalid continuation byte > > But 0xED is not a continuation byte, it's a start byte. And it's a > perfectly valid one: > >>>> b"\xed\x9f\xbf".decode() > '\ud7ff' > > Pike is more explicit about what the problem is: > >> utf8_to_string("\xed\xb4\x80"); > UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a > UTF-16 surrogate character. > > Is this something where Python's error message could do with > improvement, or is it not worth the hassle? Should I raise a tracker > issue about this? > > ChrisA > I'd raise an issue so there's a formal record that we can refer to in the future. Besides what's one issue like this compared to the "Python can't do decimal sums properly" which gets raised every few months by newbies :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence