Path: csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'string.': 0.05; 'encoded': 0.07; 'python3': 0.07; 'utf-8': 0.07; 'string': 0.09; 'bytes,': 0.09; 'encode': 0.09; 'methods,': 0.09; 'similar,': 0.09; 'subject:Why': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'suggest': 0.14; 'codecs': 0.16; 'expects': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'sense,': 0.16; 'subject:already': 0.16; 'subject:unicode': 0.16; 'wrote:': 0.18; 'module': 0.19; "skip:' 30": 0.19; 'slightly': 0.19; '>>>': 0.22; 'example': 0.22; 'import': 0.22; 'cc:addr:python.org': 0.22; 'this?': 0.23; 'byte': 0.24; 'bytes': 0.24; 'string,': 0.24; 'unicode': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; 'equivalent': 0.26; 'possibly': 0.26; 'this:': 0.26; 'second': 0.26; 'least': 0.26; 'gets': 0.27; 'header:In-Reply- To:1': 0.27; 'am,': 0.29; "doesn't": 0.30; 'strongly': 0.30; 'message-id:@mail.gmail.com': 0.30; "skip:' 10": 0.31; '>>>>': 0.31; 'subject:that': 0.31; 'text': 0.33; 'linux': 0.33; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'doing': 0.36; 'subject:?': 0.36; 'being': 0.38; 'nov': 0.38; 'track': 0.38; 'skip:u 10': 0.60; "you're": 0.61; 'first': 0.61; 'back': 0.62; '2015': 0.84; 'different.': 0.84; '\xe2\x82\xac': 0.84; '2013,': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=xF3zGNqDMoWf9DXrXaq0wDKf10MsWuF0PjiN/Tx2eBo=; b=liaElwMAznl24KJFw5rDxHTtRSa2Mrahb5ML+YMBm64LDGjSdR4/L+/MjV1VGeSCbG fzNRBg9q6z3VmGS9sPI0b96p89j0Ct0HAcEW1YvPRZRE4+APgRM/WLeo3JG06AcAjEh4 VbsJWFFKI2IP30yBB0tZCx6DID3/A4Ah/SbiWgT/Qdag9duHtSec73Lt0AAy4Y2rEfYh hZh+Iu3ufSFyekdMeszPuYM7cPSskQrE+3w7TYkiQi3EzyznkeUg0SZgqtIZaAmXrVk+ BAHTPC1PY4jAV62JjBgqztKiar9VRONTEy4BYTQQ8qx2qmHgpTjAeM8V5ZLA5lER4JeN 9Crg== MIME-Version: 1.0 X-Received: by 10.50.43.196 with SMTP id y4mr8523199igl.14.1431274016822; Sun, 10 May 2015 09:06:56 -0700 (PDT) In-Reply-To: References: Date: Mon, 11 May 2015 02:06:56 +1000 Subject: Re: Why does unicode-escape decode escape symbols that are already escaped? From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 43 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1431274025 news.xs4all.nl 2888 [2001:888:2000:d::a6]:41564 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:90292 On Mon, May 11, 2015 at 1:53 AM, Somelauw . wrote: > In Python 3, decoding "=E2=82=AC" with unicode-escape returns '=C3=A2\x82= =C2=AC' which in my > opinion doesn't make sense. > The =E2=82=AC already is decoded; if it were encoded it would look like t= his: > '\u20ac'. > So why is it doing this? > > In Python 2 the behaviour is similar, but slightly different. > > $ python3 -S > Python 3.3.3 (default, Nov 27 2013, 17:12:35) > [GCC 4.8.2] on linux >>>> import codecs >>>> codecs.decode('=E2=82=AC', 'unicode-escape') > '=C3=A2\x82=C2=AC' >>>> codecs.encode('=E2=82=AC', 'unicode-escape') > b'\\u20ac' >>>> Whenever you start encoding and decoding, you need to know whether you're working with bytes->text, text->bytes, or something else. In the case of unicode-escape, it expects to encode text into bytes, as you can see with your second example - you give it a Unicode string, and get back a byte string. When you attempt to *decode* a Unicode string, that doesn't actually make sense, so it first gets *encoded* to bytes, before being decoded. What you're actually seeing there is that the one-character string is being encoded into a three-byte UTF-8 sequence,and then the unicode-escape decode takes those bytes and interprets them as characters; as it happens, that's equivalent to a Latin-1 decode: >>> '=E2=82=AC'.encode('utf-8').decode('latin-1') '=C3=A2\x82=C2=AC' I strongly suggest leaving the codecs module aside, and working exclusively with the str.encode() and bytes.decode() methods, if you possibly can. If you can't, at very least keep track in your head of what is text and what is bytes, and which way things change in every transformation. ChrisA