Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #90292

Re: Why does unicode-escape decode escape symbols that are already escaped?

References <CA+gt_a82WGXHUZhcdbTUWG+TRV1Ys1ZSrkGjOxgavZGjAh9FiQ@mail.gmail.com>
Date 2015-05-11 02:06 +1000
Subject Re: Why does unicode-escape decode escape symbols that are already escaped?
From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.313.1431274025.12865.python-list@python.org> (permalink)

Show all headers | View raw


On Mon, May 11, 2015 at 1:53 AM, Somelauw . <somelauw@gmail.com> wrote:
> In Python 3, decoding "€" with unicode-escape returns 'â\x82¬' which in my
> opinion doesn't make sense.
> The € already is decoded; if it were encoded it would look like this:
> '\u20ac'.
> So why is it doing this?
>
> In Python 2 the behaviour is similar, but slightly different.
>
> $ python3 -S
> Python 3.3.3 (default, Nov 27 2013, 17:12:35)
> [GCC 4.8.2] on linux
>>>> import codecs
>>>> codecs.decode('€', 'unicode-escape')
> 'â\x82¬'
>>>> codecs.encode('€', 'unicode-escape')
> b'\\u20ac'
>>>>

Whenever you start encoding and decoding, you need to know whether
you're working with bytes->text, text->bytes, or something else. In
the case of unicode-escape, it expects to encode text into bytes, as
you can see with your second example - you give it a Unicode string,
and get back a byte string. When you attempt to *decode* a Unicode
string, that doesn't actually make sense, so it first gets *encoded*
to bytes, before being decoded. What you're actually seeing there is
that the one-character string is being encoded into a three-byte UTF-8
sequence,and then the unicode-escape decode takes those bytes and
interprets them as characters; as it happens, that's equivalent to a
Latin-1 decode:

>>> '€'.encode('utf-8').decode('latin-1')
'â\x82¬'

I strongly suggest leaving the codecs module aside, and working
exclusively with the str.encode() and bytes.decode() methods, if you
possibly can. If you can't, at very least keep track in your head of
what is text and what is bytes, and which way things change in every
transformation.

ChrisA

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Why does unicode-escape decode escape symbols that are already escaped? Chris Angelico <rosuav@gmail.com> - 2015-05-11 02:06 +1000

csiph-web