Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #63262 > unrolled thread
| Started by | Chris Angelico <rosuav@gmail.com> |
|---|---|
| First post | 2014-01-06 13:55 +1100 |
| Last post | 2014-01-07 10:06 +1100 |
| Articles | 14 — 6 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 13:55 +1100
Re: "More About Unicode in Python 2 and 3" Roy Smith <roy@panix.com> - 2014-01-05 23:24 -0500
Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-05 22:41 -0600
Re: "More About Unicode in Python 2 and 3" Roy Smith <roy@panix.com> - 2014-01-05 23:49 -0500
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 15:59 +1100
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-06 15:51 +1100
Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-06 05:49 -0600
Re: "More About Unicode in Python 2 and 3" Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-07 03:24 +1100
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 03:30 +1100
Re: "More About Unicode in Python 2 and 3" Serhiy Storchaka <storchaka@gmail.com> - 2014-01-06 22:20 +0200
Re: "More About Unicode in Python 2 and 3" Serhiy Storchaka <storchaka@gmail.com> - 2014-01-06 22:21 +0200
Re: "More About Unicode in Python 2 and 3" Tim Chase <python.list@tim.thechases.com> - 2014-01-06 14:42 -0600
Re: "More About Unicode in Python 2 and 3" Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-06 20:47 +0000
Re: "More About Unicode in Python 2 and 3" Chris Angelico <rosuav@gmail.com> - 2014-01-07 10:06 +1100
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-06 13:55 +1100 |
| Subject | Re: "More About Unicode in Python 2 and 3" |
| Message-ID | <mailman.5001.1388976943.18130.python-list@python.org> |
On Mon, Jan 6, 2014 at 1:23 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
> The metadata fields are simple ascii, and in Py2 something like `if
> header[FIELD_TYPE] == 'C'` did the job just fine. In Py3 that compares an
> int (67) to the unicode letter 'C' and returns False. For me this is simply
> a major annoyance, but I only have a handful of places where I have to deal
> with this. Dealing with protocols where bytes is the norm and embedded
> ascii is prevalent -- well, I can easily imagine the nightmare.
It can't be both things. It's either bytes or it's text. If it's text,
then decoding it as ascii will give you a Unicode string; if it's
small unsigned integers that just happen to correspond to ASCII
values, then I would say the right thing to do is integer constants -
or, in Python 3.4, an integer enumeration:
>>> socket.AF_INET
<AddressFamily.AF_INET: 2>
>>> socket.AF_INET == 2
True
I'm not sure what FIELD_TYPE of 'C' means, but my guess is that it's a
CHAR field. I'd just have that as the name, something like:
CHAR = b'C'[0]
if header[FIELD_TYPE] == CHAR:
# handle char field
If nothing else, this would reduce the number of places where you
actually have to handle this. Plus, the code above will work on many
versions of Python (I'm not sure how far back the b'' prefix is
allowed - probably 2.6).
ChrisA
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-01-05 23:24 -0500 |
| Message-ID | <roy-7ED5DF.23241105012014@news.panix.com> |
| In reply to | #63262 |
In article <mailman.5001.1388976943.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > It can't be both things. It's either bytes or it's text. I've never used Python 3, so forgive me if these are naive questions. Let's say you had an input stream which contained the following hex values: $ hexdump data 0000000 d7 a8 a3 88 96 95 That's EBCDIC for "Python". What would I write in Python 3 to read that file and print it back out as utf-8 encoded Unicode? Or, how about a slightly different example: $ hexdump data 0000000 43 6c 67 75 62 61 That's "Python" in rot-13 encoded ascii. How would I turn that into cleartext Unicode in Python 3?
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-05 22:41 -0600 |
| Message-ID | <mailman.5004.1388983234.18130.python-list@python.org> |
| In reply to | #63265 |
On 2014-01-05 23:24, Roy Smith wrote:
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
>
> That's EBCDIC for "Python". What would I write in Python 3 to read
> that file and print it back out as utf-8 encoded Unicode?
>
> Or, how about a slightly different example:
>
> $ hexdump data
> 0000000 43 6c 67 75 62 61
>
> That's "Python" in rot-13 encoded ascii. How would I turn that
> into cleartext Unicode in Python 3?
tim@laptop$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s1 = b'\xd7\xa8\xa3\x88\x96\x95'
>>> s1.decode('ebcdic-cp-be')
'Python'
>>> s2 = b'\x43\x6c\x67\x75\x62\x61'
>>> from codecs import getencoder
>>> getencoder("rot-13")(s2.decode('utf-8'))[0]
'Python'
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-01-05 23:49 -0500 |
| Message-ID | <roy-13C7CE.23494705012014@news.panix.com> |
| In reply to | #63267 |
In article <mailman.5004.1388983234.18130.python-list@python.org>,
Tim Chase <python.list@tim.thechases.com> wrote:
> On 2014-01-05 23:24, Roy Smith wrote:
> > $ hexdump data
> > 0000000 d7 a8 a3 88 96 95
> >
> > That's EBCDIC for "Python". What would I write in Python 3 to read
> > that file and print it back out as utf-8 encoded Unicode?
> >
> > Or, how about a slightly different example:
> >
> > $ hexdump data
> > 0000000 43 6c 67 75 62 61
> >
> > That's "Python" in rot-13 encoded ascii. How would I turn that
> > into cleartext Unicode in Python 3?
>
>
> tim@laptop$ python3
> Python 3.2.3 (default, Feb 20 2013, 14:44:27)
> [GCC 4.7.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s1 = b'\xd7\xa8\xa3\x88\x96\x95'
> >>> s1.decode('ebcdic-cp-be')
> 'Python'
> >>> s2 = b'\x43\x6c\x67\x75\x62\x61'
> >>> from codecs import getencoder
> >>> getencoder("rot-13")(s2.decode('utf-8'))[0]
> 'Python'
>
> -tkc
Thanks. But, I see I didn't formulate my problem statement well. I was
(naively) assuming there wouldn't be a built-in codec for rot-13. Let's
assume there isn't; I was trying to find a case where you had to treat
the data as integers in one place and text in another. How would you do
that?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-06 15:59 +1100 |
| Message-ID | <mailman.5006.1388984378.18130.python-list@python.org> |
| In reply to | #63268 |
On Mon, Jan 6, 2014 at 3:49 PM, Roy Smith <roy@panix.com> wrote: > Thanks. But, I see I didn't formulate my problem statement well. I was > (naively) assuming there wouldn't be a built-in codec for rot-13. Let's > assume there isn't; I was trying to find a case where you had to treat > the data as integers in one place and text in another. How would you do > that? I assumed that you would have checked that one, and answered accordingly :) Though I did dig into the EBCDIC part of the question. My thinking is that, if you're working with integers, you probably mean either bytes (so encode it before you do stuff - typical for crypto) or codepoints / Unicode ordinals (so use ord()/chr()). In other languages there are ways to treat strings as though they were arrays of integers (lots of C-derived languages treat 'a' as 97 and "a"[0] as 97 also; some extend this to the full Unicode range), and even there, I almost never actually use that identity much. There's only one case that I can think of where I did a lot of string<->integer-array transmutation, and that was using a diff function that expected an integer array - if the transformation to and from strings hadn't been really easy, that function would probably have been written to take strings. The Py2 str.translate() method was a little clunky to use, but presumably fast to execute - you build up a lookup table and translate through that. The Py3 equivalent takes a dict mapping the from and to values. Pretty easy to use. And it lets you work with codepoints or strings, as you please. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-06 15:51 +1100 |
| Message-ID | <mailman.5005.1388983873.18130.python-list@python.org> |
| In reply to | #63265 |
On Mon, Jan 6, 2014 at 3:24 PM, Roy Smith <roy@panix.com> wrote:
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
>
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
>
> That's EBCDIC for "Python". What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?
*deletes the two paragraphs that used to be here* Turns out Python 3
_does_ have an EBCDIC decoder... but it's not called EBCDIC.
>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500")
'Python'
This sounds like a good one for getting an alias, either "ebcdic" or
"EBCDIC". I didn't know that this was possible till I googled the
problem and saw someone else's solution.
To print that out as UTF-8, just decode and then encode:
>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500").encode("utf-8")
b'Python'
In the specific case of files on the disk, you could open them with
encodings specified, in which case you don't need to worry about the
details.
with open("data",encoding="cp500") as infile:
with open("data_utf8","w",encoding="utf-8") as outfile:
outfile.write(infile.read())
Of course, this is assuming that Unicode has a perfect mapping for
every EBCDIC character. I'm not familiar enough with EBCDIC to be sure
that that's true, but I strongly suspect it is. And if it's not,
you'll get an exception somewhere along the way, so you'll know
something's gone wrong. (In theory, a "transcode" function might be
able to give you a warning before it even sees your data -
transcode("utf-8", "iso-8859-3") could alert you to the possibility
that not everything in the source character set can be encoded. But
that's a pretty esoteric requirement.)
> Or, how about a slightly different example:
>
> $ hexdump data
> 0000000 43 6c 67 75 62 61
>
> That's "Python" in rot-13 encoded ascii. How would I turn that into
> cleartext Unicode in Python 3?
That's one of the points that's under dispute. Is rot13 a
bytes<->bytes encoding, or is it str<->str, or is it bytes<->str? The
issue isn't clear. Personally, I think it makes good sense as a
str<->str translation, which would mean that the process would be
somewhat thus:
>>> rot13={}
>>> for i in range(13):
rot13[65+i]=65+i+13
rot13[65+i+13]=65+i
rot13[97+i]=97+i+13
rot13[97+i+13]=97+i
>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex dump into a bytes literal?
>>> data.decode().translate(rot13)
'Python'
This is treating rot13 as a translation of Unicode codepoints to other
Unicode codepoints, which is different from an encode operation (which
takes abstract Unicode data and produces concrete bytes) or a decode
operation (which does the reverse). But this is definitely a grey
area. It's common for cryptographic algorithms to work with bytes,
meaning that their "decoded" text is still bytes. (Or even less than
bytes. The famous Enigma machines from World War II worked with the 26
letters as their domain and range.) Should the Python codecs module
restrict itself to the job of translating between bytes and str, or is
it a tidy place to put those other translations as well?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-06 05:49 -0600 |
| Message-ID | <mailman.5012.1389008911.18130.python-list@python.org> |
| In reply to | #63265 |
On 2014-01-06 15:51, Chris Angelico wrote:
> >>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
> >>> turn a hex dump into a bytes literal?
Depends on how you source them:
# space separated:
>>> s1 = "43 6c 67 75 62 61"
>>> ''.join(chr(int(pair, 16)) for pair in s1.split())
'Clguba'
# all smooshed together:
>>> s2 = s1.replace(' ','')
>>> s2
'436c67756261'
>>> ''.join(chr(int(s2[i*2:(i+1)*2], 16)) for i in range(len(s2)/2))
'Clguba'
# as \xHH escaped:
>>> s3 = ''.join('\\x'+s2[i*2:(i+1)*2] for i in range(len(s2)/2))
>>> print(s3)
\x43\x6c\x67\x75\x62\x61
>>> print(b3)
b'\\x43\\x6c\\x67\\x75\\x62\\x61'
>>> b3.decode('unicode_escape')
'Clguba'
It might get more complex if you're not just dealing with bytes, or
if you have some other encoding scheme, but "s1" (space-separated, or
some other delimiter such as colon-separated that can be passed
to the .split() call) and "s2" (all smooshed together) are the two I
encounter most frequently.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-07 03:24 +1100 |
| Message-ID | <52cad8b4$0$29984$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #63265 |
Roy Smith wrote:
> In article <mailman.5001.1388976943.18130.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
>> It can't be both things. It's either bytes or it's text.
>
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
>
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
>
> That's EBCDIC for "Python". What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?
There's no one EBCDIC encoding. Like the so-called "extended ASCII"
or "ANSI" encodings that followed, IBM had many different versions of
EBCDIC customised for different machines and markets -- only even more
poorly documented. But since the characters in that are all US English
letters, any EBCDIC dialect ought to do it:
py> b = b'\xd7\xa8\xa3\x88\x96\x95'
py> b.decode('CP500')
'Python'
To read it from a file:
text = open("somefile", encoding='CP500').read()
And to print out the UTF-8 encoded bytes:
print(text.encode('utf-8'))
> Or, how about a slightly different example:
>
> $ hexdump data
> 0000000 43 6c 67 75 62 61
>
> That's "Python" in rot-13 encoded ascii. How would I turn that into
> cleartext Unicode in Python 3?
In Python 3.3, you can do this:
py> b = b'\x43\x6c\x67\x75\x62\x61'
py> s = b.decode('ascii')
py> print(s)
Clguba
py> import codecs
py> codecs.decode(s, 'rot-13')
'Python'
(This may not work in Python 3.1 or 3.2, since rot13 and assorted other
string-to-string and byte-to-byte codecs were mistakenly removed. I say
mistakenly, not in the sense of "by accident", but in the sense of "it was
an error of judgement". Somebody was under the misapprehension that the
codec machinery could only work on Unicode <-> bytes.)
If you don't want to use the codec, you can do it by hand:
def rot13(astring):
result = []
for c in astring:
i = ord(c)
if ord('a') <= i <= ord('m') or ord('A') <= i <= ord('M'):
i += 13
elif ord('n') <= i <= ord('z') or ord('N') <= i <= ord('Z'):
i -= 13
result.append(chr(i))
return ''.join(result)
But why would you want to do it the slow way?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-07 03:30 +1100 |
| Message-ID | <mailman.5027.1389025821.18130.python-list@python.org> |
| In reply to | #63293 |
On Tue, Jan 7, 2014 at 3:24 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> If you don't want to use the codec, you can do it by hand:
>
> def rot13(astring):
> result = []
> for c in astring:
> i = ord(c)
> if ord('a') <= i <= ord('m') or ord('A') <= i <= ord('M'):
> i += 13
> elif ord('n') <= i <= ord('z') or ord('N') <= i <= ord('Z'):
> i -= 13
> result.append(chr(i))
> return ''.join(result)
>
> But why would you want to do it the slow way?
Eww. I'd much rather use .translate() than that :)
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2014-01-06 22:20 +0200 |
| Message-ID | <mailman.5053.1389039628.18130.python-list@python.org> |
| In reply to | #63265 |
06.01.14 06:51, Chris Angelico написав(ла):
>>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex dump into a bytes literal?
>>> bytes.fromhex('43 6c 67 75 62 61')
b'Clguba'
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2014-01-06 22:21 +0200 |
| Message-ID | <mailman.5054.1389039907.18130.python-list@python.org> |
| In reply to | #63265 |
06.01.14 06:41, Tim Chase написав(ла):
>>>> from codecs import getencoder
>>>> getencoder("rot-13")(s2.decode('utf-8'))[0]
> 'Python'
codecs.decode('rot13', s2.decode())
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-06 14:42 -0600 |
| Message-ID | <mailman.5057.1389040883.18130.python-list@python.org> |
| In reply to | #63265 |
On 2014-01-06 22:20, Serhiy Storchaka wrote:
> >>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
> >>>> turn a hex dump into a bytes literal?
>
> >>> bytes.fromhex('43 6c 67 75 62 61')
> b'Clguba'
Very nice new functionality in Py3k, but 2.x doesn't seem to have such
a method. :-(
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2014-01-06 20:47 +0000 |
| Message-ID | <mailman.5059.1389041257.18130.python-list@python.org> |
| In reply to | #63265 |
On 06/01/2014 20:42, Tim Chase wrote:
> On 2014-01-06 22:20, Serhiy Storchaka wrote:
>>>>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
>>>>>> turn a hex dump into a bytes literal?
>>
>> >>> bytes.fromhex('43 6c 67 75 62 61')
>> b'Clguba'
>
> Very nice new functionality in Py3k, but 2.x doesn't seem to have such
> a method. :-(
>
> -tkc
>
Seems like another mistake, that'll have to be regressed to make sure
there is Python 2 and Python 3 compatibility, which can then be
reintroduced into Python 2.8 so that it gets back into Python 3.
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-07 10:06 +1100 |
| Message-ID | <mailman.5088.1389049571.18130.python-list@python.org> |
| In reply to | #63265 |
On Tue, Jan 7, 2014 at 7:42 AM, Tim Chase <python.list@tim.thechases.com> wrote:
> On 2014-01-06 22:20, Serhiy Storchaka wrote:
>> >>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to
>> >>>> turn a hex dump into a bytes literal?
>>
>> >>> bytes.fromhex('43 6c 67 75 62 61')
>> b'Clguba'
>
> Very nice new functionality in Py3k, but 2.x doesn't seem to have such
> a method. :-(
Thanks, Serhiy. Very nice new functionality indeed, and not having it
in 2.x isn't a problem to me. That's exactly what I was looking for -
it doesn't insist on (or complain about) separators between bytes.
(Though the error from putting a space _inside_ a byte is a little
confusing. But that's trivial.)
ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web