Groups > comp.lang.python > #63262 > unrolled thread

Re: "More About Unicode in Python 2 and 3"

Started by	Chris Angelico <rosuav@gmail.com>
First post	2014-01-06 13:55 +1100
Last post	2014-01-07 10:06 +1100
Articles	14 — 6 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 13:55 +1100
    Re: "More About Unicode in Python 2 and 3" Roy Smith <roy@panix.com> - 2014-01-05 23:24 -0500
      Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-05 22:41 -0600
        Re: "More About Unicode in Python 2 and 3" Roy Smith <roy@panix.com> - 2014-01-05 23:49 -0500
          Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 15:59 +1100
      Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 15:51 +1100
      Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-06 05:49 -0600
      Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 03:24 +1100
        Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 03:30 +1100
      Re: "More About Unicode in Python 2 and 3" Serhiy Storchaka <storchaka@gmail.com> - 2014-01-06 22:20 +0200
      Re: "More About Unicode in Python 2 and 3" Serhiy Storchaka <storchaka@gmail.com> - 2014-01-06 22:21 +0200
      Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-06 14:42 -0600
      Re: "More About Unicode in Python 2 and 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-06 20:47 +0000
      Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 10:06 +1100

#63262 — Re: "More About Unicode in Python 2 and 3"

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-06 13:55 +1100
Subject	Re: "More About Unicode in Python 2 and 3"
Message-ID	<mailman.5001.1388976943.18130.python-list@python.org>

On Mon, Jan 6, 2014 at 1:23 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
> The metadata fields are simple ascii, and in Py2 something like `if
> header[FIELD_TYPE] == 'C'` did the job just fine.  In Py3 that compares an
> int (67) to the unicode letter 'C' and returns False.  For me this is simply
> a major annoyance, but I only have a handful of places where I have to deal
> with this.  Dealing with protocols where bytes is the norm and embedded
> ascii is prevalent -- well, I can easily imagine the nightmare.

It can't be both things. It's either bytes or it's text. If it's text,
then decoding it as ascii will give you a Unicode string; if it's
small unsigned integers that just happen to correspond to ASCII
values, then I would say the right thing to do is integer constants -
or, in Python 3.4, an integer enumeration:

>>> socket.AF_INET
<AddressFamily.AF_INET: 2>
>>> socket.AF_INET == 2
True

I'm not sure what FIELD_TYPE of 'C' means, but my guess is that it's a
CHAR field. I'd just have that as the name, something like:

CHAR = b'C'[0]

if header[FIELD_TYPE] == CHAR:
    # handle char field

If nothing else, this would reduce the number of places where you
actually have to handle this. Plus, the code above will work on many
versions of Python (I'm not sure how far back the b'' prefix is
allowed - probably 2.6).

ChrisA

[toc] | [next] | [standalone]

#63265

From	Roy Smith <roy@panix.com>
Date	2014-01-05 23:24 -0500
Message-ID	<roy-7ED5DF.23241105012014@news.panix.com>
In reply to	#63262

In article <mailman.5001.1388976943.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> It can't be both things. It's either bytes or it's text. 

I've never used Python 3, so forgive me if these are naive questions.  
Let's say you had an input stream which contained the following hex 
values:

$ hexdump data
0000000 d7 a8 a3 88 96 95

That's EBCDIC for "Python".  What would I write in Python 3 to read that 
file and print it back out as utf-8 encoded Unicode?

Or, how about a slightly different example:

$ hexdump data
0000000 43 6c 67 75 62 61

That's "Python" in rot-13 encoded ascii.  How would I turn that into 
cleartext Unicode in Python 3?

[toc] | [prev] | [next] | [standalone]

#63267

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-01-05 22:41 -0600
Message-ID	<mailman.5004.1388983234.18130.python-list@python.org>
In reply to	#63265

On 2014-01-05 23:24, Roy Smith wrote:
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
> 
> That's EBCDIC for "Python".  What would I write in Python 3 to read
> that file and print it back out as utf-8 encoded Unicode?
> 
> Or, how about a slightly different example:
> 
> $ hexdump data
> 0000000 43 6c 67 75 62 61
> 
> That's "Python" in rot-13 encoded ascii.  How would I turn that
> into cleartext Unicode in Python 3?


tim@laptop$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s1 = b'\xd7\xa8\xa3\x88\x96\x95'
>>> s1.decode('ebcdic-cp-be')
'Python'
>>> s2 = b'\x43\x6c\x67\x75\x62\x61'
>>> from codecs import getencoder
>>> getencoder("rot-13")(s2.decode('utf-8'))[0]
'Python'

-tkc

[toc] | [prev] | [next] | [standalone]

#63268

From	Roy Smith <roy@panix.com>
Date	2014-01-05 23:49 -0500
Message-ID	<roy-13C7CE.23494705012014@news.panix.com>
In reply to	#63267

In article <mailman.5004.1388983234.18130.python-list@python.org>,
 Tim Chase <python.list@tim.thechases.com> wrote:

> On 2014-01-05 23:24, Roy Smith wrote:
> > $ hexdump data
> > 0000000 d7 a8 a3 88 96 95
> > 
> > That's EBCDIC for "Python".  What would I write in Python 3 to read
> > that file and print it back out as utf-8 encoded Unicode?
> > 
> > Or, how about a slightly different example:
> > 
> > $ hexdump data
> > 0000000 43 6c 67 75 62 61
> > 
> > That's "Python" in rot-13 encoded ascii.  How would I turn that
> > into cleartext Unicode in Python 3?
> 
> 
> tim@laptop$ python3
> Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
> [GCC 4.7.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s1 = b'\xd7\xa8\xa3\x88\x96\x95'
> >>> s1.decode('ebcdic-cp-be')
> 'Python'
> >>> s2 = b'\x43\x6c\x67\x75\x62\x61'
> >>> from codecs import getencoder
> >>> getencoder("rot-13")(s2.decode('utf-8'))[0]
> 'Python'
> 
> -tkc

Thanks.  But, I see I didn't formulate my problem statement well.  I was 
(naively) assuming there wouldn't be a built-in codec for rot-13.  Let's 
assume there isn't; I was trying to find a case where you had to treat 
the data as integers in one place and text in another.  How would you do 
that?

[toc] | [prev] | [next] | [standalone]

#63270

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-06 15:59 +1100
Message-ID	<mailman.5006.1388984378.18130.python-list@python.org>
In reply to	#63268

On Mon, Jan 6, 2014 at 3:49 PM, Roy Smith <roy@panix.com> wrote:
> Thanks.  But, I see I didn't formulate my problem statement well.  I was
> (naively) assuming there wouldn't be a built-in codec for rot-13.  Let's
> assume there isn't; I was trying to find a case where you had to treat
> the data as integers in one place and text in another.  How would you do
> that?

I assumed that you would have checked that one, and answered
accordingly :) Though I did dig into the EBCDIC part of the question.

My thinking is that, if you're working with integers, you probably
mean either bytes (so encode it before you do stuff - typical for
crypto) or codepoints / Unicode ordinals (so use ord()/chr()). In
other languages there are ways to treat strings as though they were
arrays of integers (lots of C-derived languages treat 'a' as 97 and
"a"[0] as 97 also; some extend this to the full Unicode range), and
even there, I almost never actually use that identity much. There's
only one case that I can think of where I did a lot of
string<->integer-array transmutation, and that was using a diff
function that expected an integer array - if the transformation to and
from strings hadn't been really easy, that function would probably
have been written to take strings.

The Py2 str.translate() method was a little clunky to use, but
presumably fast to execute - you build up a lookup table and translate
through that. The Py3 equivalent takes a dict mapping the from and to
values. Pretty easy to use. And it lets you work with codepoints or
strings, as you please.

ChrisA

[toc] | [prev] | [next] | [standalone]

#63269

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-06 15:51 +1100
Message-ID	<mailman.5005.1388983873.18130.python-list@python.org>
In reply to	#63265

On Mon, Jan 6, 2014 at 3:24 PM, Roy Smith <roy@panix.com> wrote:
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
>
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
>
> That's EBCDIC for "Python".  What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?

*deletes the two paragraphs that used to be here* Turns out Python 3
_does_ have an EBCDIC decoder... but it's not called EBCDIC.

>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500")
'Python'

This sounds like a good one for getting an alias, either "ebcdic" or
"EBCDIC". I didn't know that this was possible till I googled the
problem and saw someone else's solution.

To print that out as UTF-8, just decode and then encode:

>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500").encode("utf-8")
b'Python'

In the specific case of files on the disk, you could open them with
encodings specified, in which case you don't need to worry about the
details.

with open("data",encoding="cp500") as infile:
    with open("data_utf8","w",encoding="utf-8") as outfile:
        outfile.write(infile.read())

Of course, this is assuming that Unicode has a perfect mapping for
every EBCDIC character. I'm not familiar enough with EBCDIC to be sure
that that's true, but I strongly suspect it is. And if it's not,
you'll get an exception somewhere along the way, so you'll know
something's gone wrong. (In theory, a "transcode" function might be
able to give you a warning before it even sees your data -
transcode("utf-8", "iso-8859-3") could alert you to the possibility
that not everything in the source character set can be encoded. But
that's a pretty esoteric requirement.)

> Or, how about a slightly different example:
>
> $ hexdump data
> 0000000 43 6c 67 75 62 61
>
> That's "Python" in rot-13 encoded ascii.  How would I turn that into
> cleartext Unicode in Python 3?

That's one of the points that's under dispute. Is rot13 a
bytes<->bytes encoding, or is it str<->str, or is it bytes<->str? The
issue isn't clear. Personally, I think it makes good sense as a
str<->str translation, which would mean that the process would be
somewhat thus:

>>> rot13={}
>>> for i in range(13):
        rot13[65+i]=65+i+13
        rot13[65+i+13]=65+i
        rot13[97+i]=97+i+13
        rot13[97+i+13]=97+i

>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex dump into a bytes literal?
>>> data.decode().translate(rot13)
'Python'

This is treating rot13 as a translation of Unicode codepoints to other
Unicode codepoints, which is different from an encode operation (which
takes abstract Unicode data and produces concrete bytes) or a decode
operation (which does the reverse). But this is definitely a grey
area. It's common for cryptographic algorithms to work with bytes,
meaning that their "decoded" text is still bytes. (Or even less than
bytes. The famous Enigma machines from World War II worked with the 26
letters as their domain and range.) Should the Python codecs module
restrict itself to the job of translating between bytes and str, or is
it a tidy place to put those other translations as well?

ChrisA

[toc] | [prev] | [next] | [standalone]

#63276

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-01-06 05:49 -0600
Message-ID	<mailman.5012.1389008911.18130.python-list@python.org>
In reply to	#63265

On 2014-01-06 15:51, Chris Angelico wrote:
> >>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
> >>> turn a hex dump into a bytes literal?

Depends on how you source them:

# space separated:
>>> s1 = "43 6c 67 75 62 61"
>>> ''.join(chr(int(pair, 16)) for pair in s1.split())
'Clguba'

# all smooshed together:
>>> s2 = s1.replace(' ','')
>>> s2
'436c67756261'
>>> ''.join(chr(int(s2[i*2:(i+1)*2], 16)) for i in range(len(s2)/2))
'Clguba'

# as \xHH escaped:
>>> s3 = ''.join('\\x'+s2[i*2:(i+1)*2] for i in range(len(s2)/2))
>>> print(s3)
\x43\x6c\x67\x75\x62\x61
>>> print(b3)
b'\\x43\\x6c\\x67\\x75\\x62\\x61'
>>> b3.decode('unicode_escape')
'Clguba'

It might get more complex if you're not just dealing with bytes, or
if you have some other encoding scheme, but "s1" (space-separated, or
some other delimiter such as colon-separated that can be passed
to the .split() call) and "s2" (all smooshed together) are the two I
encounter most frequently.

-tkc

[toc] | [prev] | [next] | [standalone]

#63293

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-01-07 03:24 +1100
Message-ID	<52cad8b4$0$29984$c3e8da3$5496439d@news.astraweb.com>
In reply to	#63265

Roy Smith wrote:

> In article <mailman.5001.1388976943.18130.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
> 
>> It can't be both things. It's either bytes or it's text.
> 
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
> 
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
> 
> That's EBCDIC for "Python".  What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?

There's no one EBCDIC encoding. Like the so-called "extended ASCII"
or "ANSI" encodings that followed, IBM had many different versions of
EBCDIC customised for different machines and markets -- only even more
poorly documented. But since the characters in that are all US English
letters, any EBCDIC dialect ought to do it:

py> b = b'\xd7\xa8\xa3\x88\x96\x95'
py> b.decode('CP500')
'Python'

To read it from a file:

text = open("somefile", encoding='CP500').read()

And to print out the UTF-8 encoded bytes:

print(text.encode('utf-8'))

> Or, how about a slightly different example:
> 
> $ hexdump data
> 0000000 43 6c 67 75 62 61
> 
> That's "Python" in rot-13 encoded ascii.  How would I turn that into
> cleartext Unicode in Python 3?

In Python 3.3, you can do this:

py> b = b'\x43\x6c\x67\x75\x62\x61'
py> s = b.decode('ascii')
py> print(s)
Clguba
py> import codecs
py> codecs.decode(s, 'rot-13')
'Python'

(This may not work in Python 3.1 or 3.2, since rot13 and assorted other
string-to-string and byte-to-byte codecs were mistakenly removed. I say
mistakenly, not in the sense of "by accident", but in the sense of "it was
an error of judgement". Somebody was under the misapprehension that the
codec machinery could only work on Unicode <-> bytes.)

If you don't want to use the codec, you can do it by hand:

def rot13(astring):
    result = []
    for c in astring:
        i = ord(c)
        if ord('a') <= i <= ord('m') or ord('A') <= i <= ord('M'):
            i += 13
        elif ord('n') <= i <= ord('z') or ord('N') <= i <= ord('Z'):
            i -= 13
        result.append(chr(i))
    return ''.join(result)

But why would you want to do it the slow way?

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#63295

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-07 03:30 +1100
Message-ID	<mailman.5027.1389025821.18130.python-list@python.org>
In reply to	#63293

On Tue, Jan 7, 2014 at 3:24 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> If you don't want to use the codec, you can do it by hand:
>
> def rot13(astring):
>     result = []
>     for c in astring:
>         i = ord(c)
>         if ord('a') <= i <= ord('m') or ord('A') <= i <= ord('M'):
>             i += 13
>         elif ord('n') <= i <= ord('z') or ord('N') <= i <= ord('Z'):
>             i -= 13
>         result.append(chr(i))
>     return ''.join(result)
>
> But why would you want to do it the slow way?

Eww. I'd much rather use .translate() than that :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#63333

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2014-01-06 22:20 +0200
Message-ID	<mailman.5053.1389039628.18130.python-list@python.org>
In reply to	#63265

06.01.14 06:51, Chris Angelico написав(ла):
>>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex dump into a bytes literal?

 >>> bytes.fromhex('43 6c 67 75 62 61')
b'Clguba'

[toc] | [prev] | [next] | [standalone]

#63334

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2014-01-06 22:21 +0200
Message-ID	<mailman.5054.1389039907.18130.python-list@python.org>
In reply to	#63265

06.01.14 06:41, Tim Chase написав(ла):
>>>> from codecs import getencoder
>>>> getencoder("rot-13")(s2.decode('utf-8'))[0]
> 'Python'

codecs.decode('rot13', s2.decode())

[toc] | [prev] | [next] | [standalone]

#63337

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-01-06 14:42 -0600
Message-ID	<mailman.5057.1389040883.18130.python-list@python.org>
In reply to	#63265

On 2014-01-06 22:20, Serhiy Storchaka wrote:
> >>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
> >>>> turn a hex dump into a bytes literal?  
> 
>  >>> bytes.fromhex('43 6c 67 75 62 61')  
> b'Clguba'

Very nice new functionality in Py3k, but 2.x doesn't seem to have such
a method. :-(

-tkc

[toc] | [prev] | [next] | [standalone]

#63339

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-06 20:47 +0000
Message-ID	<mailman.5059.1389041257.18130.python-list@python.org>
In reply to	#63265

On 06/01/2014 20:42, Tim Chase wrote:
> On 2014-01-06 22:20, Serhiy Storchaka wrote:
>>>>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
>>>>>> turn a hex dump into a bytes literal?
>>
>>   >>> bytes.fromhex('43 6c 67 75 62 61')
>> b'Clguba'
>
> Very nice new functionality in Py3k, but 2.x doesn't seem to have such
> a method. :-(
>
> -tkc
>

Seems like another mistake, that'll have to be regressed to make sure 
there is Python 2 and Python 3 compatibility, which can then be 
reintroduced into Python 2.8 so that it gets back into Python 3.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#63370

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-07 10:06 +1100
Message-ID	<mailman.5088.1389049571.18130.python-list@python.org>
In reply to	#63265

On Tue, Jan 7, 2014 at 7:42 AM, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2014-01-06 22:20, Serhiy Storchaka wrote:
>> >>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
>> >>>> turn a hex dump into a bytes literal?
>>
>>  >>> bytes.fromhex('43 6c 67 75 62 61')
>> b'Clguba'
>
> Very nice new functionality in Py3k, but 2.x doesn't seem to have such
> a method. :-(

Thanks, Serhiy. Very nice new functionality indeed, and not having it
in 2.x isn't a problem to me. That's exactly what I was looking for -
it doesn't insist on (or complain about) separators between bytes.
(Though the error from putting a space _inside_ a byte is a little
confusing. But that's trivial.)

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: "More About Unicode in Python 2 and 3"

Contents

#63262 — Re: "More About Unicode in Python 2 and 3"

#63265

#63267

#63268

#63270

#63269

#63276

#63293

#63295

#63333

#63334

#63337

#63339

#63370