Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #26656 > unrolled thread
| Started by | Mok-Kong Shen <mok-kong.shen@t-online.de> |
|---|---|
| First post | 2012-08-06 22:46 +0200 |
| Last post | 2012-08-07 13:17 -0700 |
| Articles | 8 — 6 participants |
Back to article view | Back to comp.lang.python
[newbie] String to binary conversion Mok-Kong Shen <mok-kong.shen@t-online.de> - 2012-08-06 22:46 +0200
Re: [newbie] String to binary conversion Tobiah <toby@tobiah.org> - 2012-08-06 13:59 -0700
Re: [newbie] String to binary conversion Tobiah <toby@tobiah.org> - 2012-08-06 14:01 -0700
Re: [newbie] String to binary conversion Mok-Kong Shen <mok-kong.shen@t-online.de> - 2012-08-06 23:33 +0200
Re: [newbie] String to binary conversion MRAB <python@mrabarnett.plus.com> - 2012-08-06 22:56 +0100
Re: [newbie] String to binary conversion Emile van Sebille <emile@fenx.com> - 2012-08-06 15:45 -0700
Re: [newbie] String to binary conversion Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-07 02:01 +0000
Re: [newbie] String to binary conversion 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-07 13:17 -0700
| From | Mok-Kong Shen <mok-kong.shen@t-online.de> |
|---|---|
| Date | 2012-08-06 22:46 +0200 |
| Subject | [newbie] String to binary conversion |
| Message-ID | <jvpafd$vig$1@news.albasani.net> |
If I have a string "abcd" then, with 8-bit encoding of each character, there is a corresponding 32-bit binary integer. How could I best obtain that integer and from that integer backwards again obtain the original string? Thanks in advance. M. K. Shen
[toc] | [next] | [standalone]
| From | Tobiah <toby@tobiah.org> |
|---|---|
| Date | 2012-08-06 13:59 -0700 |
| Message-ID | <0fWTr.70$Bw1.65@newsfe05.iad> |
| In reply to | #26656 |
The binascii module looks like it might have something for you. I've never used it. Tobiah http://docs.python.org/library/binascii.html On 08/06/2012 01:46 PM, Mok-Kong Shen wrote: > > If I have a string "abcd" then, with 8-bit encoding of each character, > there is a corresponding 32-bit binary integer. How could I best > obtain that integer and from that integer backwards again obtain the > original string? Thanks in advance. > > M. K. Shen
[toc] | [prev] | [next] | [standalone]
| From | Tobiah <toby@tobiah.org> |
|---|---|
| Date | 2012-08-06 14:01 -0700 |
| Message-ID | <igWTr.71$Bw1.43@newsfe05.iad> |
| In reply to | #26658 |
On 08/06/2012 01:59 PM, Tobiah wrote: > The binascii module looks like it might have > something for you. I've never used it. Having actually read some of that doc, I see it's not what you want at all. Sorry.
[toc] | [prev] | [next] | [standalone]
| From | Mok-Kong Shen <mok-kong.shen@t-online.de> |
|---|---|
| Date | 2012-08-06 23:33 +0200 |
| Message-ID | <jvpd6o$59p$1@news.albasani.net> |
| In reply to | #26658 |
Am 06.08.2012 22:59, schrieb Tobiah:
> The binascii module looks like it might have
> something for you. I've never used it.
Thanks for the hint, but if I don't err, the module binascii doesn't
seem to work. I typed:
import binascii
and a line that's given as example in the document:
crc = binascii.crc32("hello")
but got the following error message:
TypeError: 'str' does not support the buffer interface.
The same error message appeared when I tried the other functions.
M. K. Shen
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-08-06 22:56 +0100 |
| Message-ID | <mailman.3033.1344290169.4697.python-list@python.org> |
| In reply to | #26656 |
On 06/08/2012 21:46, Mok-Kong Shen wrote:
>
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best
> obtain that integer and from that integer backwards again obtain the
> original string? Thanks in advance.
>
Try this (Python 3, in which strings are Unicode):
>>> import struct
>>> # For a little-endian integer
>>> struct.unpack("<I", "abcd".encode("latin-1"))[0]
1684234849
>>> hex(_)
'0x64636261'
or this (Python 2, in which strings are bytestrings):
>>> import struct
>>> # For a little-endian integer
>>> struct.unpack("<I", "abcd")[0]
1684234849
>>> hex(_)
'0x64636261'
[toc] | [prev] | [next] | [standalone]
| From | Emile van Sebille <emile@fenx.com> |
|---|---|
| Date | 2012-08-06 15:45 -0700 |
| Message-ID | <mailman.3038.1344293171.4697.python-list@python.org> |
| In reply to | #26656 |
On 8/6/2012 1:46 PM Mok-Kong Shen said...
>
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best
> obtain that integer and from that integer backwards again obtain the
> original string? Thanks in advance.
It's easy to write one:
def str2val(str,_val=0):
if len(str)>1: return str2val(str[1:],256*_val+ord(str[0]))
return 256*_val+ord(str[0])
def val2str(val,_str=""):
if val>256: return val2str(int(val/256),_str)+chr(val%256)
return _str+chr(val)
print str2val("abcd")
print val2str(str2val("abcd"))
print val2str(str2val("good"))
print val2str(str2val("longer"))
print val2str(str2val("verymuchlonger"))
Flavor to taste.
Emile
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-07 02:01 +0000 |
| Message-ID | <502076e1$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #26656 |
On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best obtain
> that integer and from that integer backwards again obtain the original
> string? Thanks in advance.
First you have to know the encoding, as that will define the integers you
get. There are many 8-bit encodings, but of course they can't all encode
arbitrary 4-character strings. Since there are tens of thousands of
different characters, and an 8-bit encoding can only code for 256 of
them, there are many strings that an encoding cannot handle.
For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
Sticking to one-byte encodings: since most of them are compatible with
ASCII, examples with "abcd" aren't very interesting:
py> 'abcd'.encode('latin1')
b'abcd'
Even though the bytes object b'abcd' is printed as if it were a string,
it is actually treated as an array of one-byte ints:
py> b'abcd'[0]
97
Here's a more interesting example, using Python 3: it uses at least one
character (the Greek letter π) which cannot be encoded in Latin1, and two
which cannot be encoded in ASCII:
py> "aπ©d".encode('iso-8859-7')
b'a\xf0\xa9d'
Most encodings will round-trip successfully:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True
(although the ability to round-trip is a property of the encoding itself,
not of the encoding system).
Naturally if you encode with one encoding, and then decode with another,
you are likely to get different strings:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'að©Z!'
py> data.decode('iso-8859-14')
'aŵ©Z!'
Both the encode and decode methods take an optional argument, errors,
which specify the error handling scheme. The default is errors='strict',
which raises an exception. Others include 'ignore' and 'replace'.
py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aŵðπ©Z!'.encode('ascii', 'replace')
b'a????Z!'
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-08-07 13:17 -0700 |
| Message-ID | <4ce1aafc-7cf2-4687-ab0a-7aa42d01173b@googlegroups.com> |
| In reply to | #26673 |
Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道:
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
>
>
>
> > If I have a string "abcd" then, with 8-bit encoding of each character,
>
> > there is a corresponding 32-bit binary integer. How could I best obtain
>
> > that integer and from that integer backwards again obtain the original
>
> > string? Thanks in advance.
>
>
>
> First you have to know the encoding, as that will define the integers you
>
> get. There are many 8-bit encodings, but of course they can't all encode
>
> arbitrary 4-character strings. Since there are tens of thousands of
>
> different characters, and an 8-bit encoding can only code for 256 of
>
> them, there are many strings that an encoding cannot handle.
>
>
>
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
>
>
>
> Sticking to one-byte encodings: since most of them are compatible with
>
> ASCII, examples with "abcd" aren't very interesting:
>
>
>
> py> 'abcd'.encode('latin1')
>
> b'abcd'
>
>
>
> Even though the bytes object b'abcd' is printed as if it were a string,
>
> it is actually treated as an array of one-byte ints:
>
>
>
> py> b'abcd'[0]
>
> 97
>
>
>
> Here's a more interesting example, using Python 3: it uses at least one
>
> character (the Greek letter π) which cannot be encoded in Latin1, and two
>
> which cannot be encoded in ASCII:
>
>
>
> py> "aπ©d".encode('iso-8859-7')
>
> b'a\xf0\xa9d'
>
>
>
> Most encodings will round-trip successfully:
>
>
>
> py> text = 'aπ©Z!'
>
> py> data = text.encode('iso-8859-7')
>
> py> data.decode('iso-8859-7') == text
>
> True
>
>
>
>
>
> (although the ability to round-trip is a property of the encoding itself,
>
> not of the encoding system).
>
>
>
> Naturally if you encode with one encoding, and then decode with another,
>
> you are likely to get different strings:
>
>
>
> py> text = 'aπ©Z!'
>
> py> data = text.encode('iso-8859-7')
>
> py> data.decode('latin1')
>
> 'að©Z!'
>
> py> data.decode('iso-8859-14')
>
> 'aŵ©Z!'
>
>
>
>
>
> Both the encode and decode methods take an optional argument, errors,
>
> which specify the error handling scheme. The default is errors='strict',
>
> which raises an exception. Others include 'ignore' and 'replace'.
>
>
>
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
>
> b'aZ!'
>
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
>
> b'a????Z!'
>
>
>
>
>
>
>
> --
>
> Steven
Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道:
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
>
>
>
> > If I have a string "abcd" then, with 8-bit encoding of each character,
>
> > there is a corresponding 32-bit binary integer. How could I best obtain
>
> > that integer and from that integer backwards again obtain the original
>
> > string? Thanks in advance.
>
>
>
> First you have to know the encoding, as that will define the integers you
>
> get. There are many 8-bit encodings, but of course they can't all encode
>
> arbitrary 4-character strings. Since there are tens of thousands of
>
> different characters, and an 8-bit encoding can only code for 256 of
>
> them, there are many strings that an encoding cannot handle.
>
>
>
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
>
>
>
> Sticking to one-byte encodings: since most of them are compatible with
>
> ASCII, examples with "abcd" aren't very interesting:
>
>
>
> py> 'abcd'.encode('latin1')
>
> b'abcd'
>
>
>
> Even though the bytes object b'abcd' is printed as if it were a string,
>
> it is actually treated as an array of one-byte ints:
>
>
>
> py> b'abcd'[0]
>
> 97
>
>
>
> Here's a more interesting example, using Python 3: it uses at least one
>
> character (the Greek letter π) which cannot be encoded in Latin1, and two
>
> which cannot be encoded in ASCII:
>
>
>
> py> "aπ©d".encode('iso-8859-7')
>
> b'a\xf0\xa9d'
>
>
>
> Most encodings will round-trip successfully:
>
>
>
> py> text = 'aπ©Z!'
>
> py> data = text.encode('iso-8859-7')
>
> py> data.decode('iso-8859-7') == text
>
> True
>
>
>
>
>
> (although the ability to round-trip is a property of the encoding itself,
>
> not of the encoding system).
>
>
>
> Naturally if you encode with one encoding, and then decode with another,
>
> you are likely to get different strings:
>
>
>
> py> text = 'aπ©Z!'
>
> py> data = text.encode('iso-8859-7')
>
> py> data.decode('latin1')
>
> 'að©Z!'
>
> py> data.decode('iso-8859-14')
>
> 'aŵ©Z!'
>
>
>
>
>
> Both the encode and decode methods take an optional argument, errors,
>
> which specify the error handling scheme. The default is errors='strict',
>
> which raises an exception. Others include 'ignore' and 'replace'.
>
>
>
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
>
> b'aZ!'
>
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
>
> b'a????Z!'
>
>
>
>
>
>
>
> --
>
> Steven
I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.
Actually for each kind of some character encoding,
please develop a codec to UTF-8 or UTF-16.
It means one can make conversions between any two of the qualified
character sets.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web