Groups > comp.lang.python > #26656 > unrolled thread

[newbie] String to binary conversion

Started by	Mok-Kong Shen <mok-kong.shen@t-online.de>
First post	2012-08-06 22:46 +0200
Last post	2012-08-07 13:17 -0700
Articles	8 — 6 participants

Back to article view | Back to comp.lang.python

  [newbie] String to binary conversion Mok-Kong Shen <mok-kong.shen@t-online.de> - 2012-08-06 22:46 +0200
    Re: [newbie] String to binary conversion Tobiah <toby@tobiah.org> - 2012-08-06 13:59 -0700
      Re: [newbie] String to binary conversion Tobiah <toby@tobiah.org> - 2012-08-06 14:01 -0700
      Re: [newbie] String to binary conversion Mok-Kong Shen <mok-kong.shen@t-online.de> - 2012-08-06 23:33 +0200
    Re: [newbie] String to binary conversion MRAB <python@mrabarnett.plus.com> - 2012-08-06 22:56 +0100
    Re: [newbie] String to binary conversion Emile van Sebille <emile@fenx.com> - 2012-08-06 15:45 -0700
    Re: [newbie] String to binary conversion Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-07 02:01 +0000
      Re: [newbie] String to binary conversion 88888 Dihedral <dihedral88888@googlemail.com> - 2012-08-07 13:17 -0700

#26656 — [newbie] String to binary conversion

From	Mok-Kong Shen <mok-kong.shen@t-online.de>
Date	2012-08-06 22:46 +0200
Subject	[newbie] String to binary conversion
Message-ID	<jvpafd$vig$1@news.albasani.net>

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

M. K. Shen

[toc] | [next] | [standalone]

#26658

From	Tobiah <toby@tobiah.org>
Date	2012-08-06 13:59 -0700
Message-ID	<0fWTr.70$Bw1.65@newsfe05.iad>
In reply to	#26656

The binascii module looks like it might have
something for you.  I've never used it.

Tobiah

http://docs.python.org/library/binascii.html

On 08/06/2012 01:46 PM, Mok-Kong Shen wrote:
>
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best
> obtain that integer and from that integer backwards again obtain the
> original string? Thanks in advance.
>
> M. K. Shen

[toc] | [prev] | [next] | [standalone]

#26660

From	Tobiah <toby@tobiah.org>
Date	2012-08-06 14:01 -0700
Message-ID	<igWTr.71$Bw1.43@newsfe05.iad>
In reply to	#26658

On 08/06/2012 01:59 PM, Tobiah wrote:
> The binascii module looks like it might have
> something for you. I've never used it.

Having actually read some of that doc, I see
it's not what you want at all.  Sorry.

[toc] | [prev] | [next] | [standalone]

#26661

From	Mok-Kong Shen <mok-kong.shen@t-online.de>
Date	2012-08-06 23:33 +0200
Message-ID	<jvpd6o$59p$1@news.albasani.net>
In reply to	#26658

Am 06.08.2012 22:59, schrieb Tobiah:
> The binascii module looks like it might have
> something for you.  I've never used it.

Thanks for the hint, but if I don't err, the module binascii doesn't
seem to work. I typed:

import binascii

and a line that's given as example in the document:

crc = binascii.crc32("hello")

but got the following error message:

TypeError: 'str' does not support the buffer interface.

The same error message appeared when I tried the other functions.

M. K. Shen

[toc] | [prev] | [next] | [standalone]

#26662

From	MRAB <python@mrabarnett.plus.com>
Date	2012-08-06 22:56 +0100
Message-ID	<mailman.3033.1344290169.4697.python-list@python.org>
In reply to	#26656

On 06/08/2012 21:46, Mok-Kong Shen wrote:
>
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best
> obtain that integer and from that integer backwards again obtain the
> original string? Thanks in advance.
>
Try this (Python 3, in which strings are Unicode):
>>> import struct
 >>> # For a little-endian integer
>>> struct.unpack("<I", "abcd".encode("latin-1"))[0]
1684234849
>>> hex(_)
'0x64636261'

or this (Python 2, in which strings are bytestrings):
 >>> import struct
 >>> # For a little-endian integer
 >>> struct.unpack("<I", "abcd")[0]
1684234849
 >>> hex(_)
'0x64636261'

[toc] | [prev] | [next] | [standalone]

#26668

From	Emile van Sebille <emile@fenx.com>
Date	2012-08-06 15:45 -0700
Message-ID	<mailman.3038.1344293171.4697.python-list@python.org>
In reply to	#26656

On 8/6/2012 1:46 PM Mok-Kong Shen said...
>
> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best
> obtain that integer and from that integer backwards again obtain the
> original string? Thanks in advance.

It's easy to write one:

def str2val(str,_val=0):
     if len(str)>1: return str2val(str[1:],256*_val+ord(str[0]))
     return 256*_val+ord(str[0])


def val2str(val,_str=""):
     if val>256: return val2str(int(val/256),_str)+chr(val%256)
     return _str+chr(val)


print str2val("abcd")
print val2str(str2val("abcd"))
print val2str(str2val("good"))
print val2str(str2val("longer"))
print val2str(str2val("verymuchlonger"))

Flavor to taste.

Emile

[toc] | [prev] | [next] | [standalone]

#26673

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-07 02:01 +0000
Message-ID	<502076e1$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#26656

On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:

> If I have a string "abcd" then, with 8-bit encoding of each character,
> there is a corresponding 32-bit binary integer. How could I best obtain
> that integer and from that integer backwards again obtain the original
> string? Thanks in advance.

First you have to know the encoding, as that will define the integers you 
get. There are many 8-bit encodings, but of course they can't all encode 
arbitrary 4-character strings. Since there are tens of thousands of 
different characters, and an 8-bit encoding can only code for 256 of 
them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with 
ASCII, examples with "abcd" aren't very interesting:

py> 'abcd'.encode('latin1')
b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string, 
it is actually treated as an array of one-byte ints:

py> b'abcd'[0]
97

Here's a more interesting example, using Python 3: it uses at least one 
character (the Greek letter π) which cannot be encoded in Latin1, and two 
which cannot be encoded in ASCII:

py> "aπ©d".encode('iso-8859-7')
b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True

(although the ability to round-trip is a property of the encoding itself, 
not of the encoding system).

Naturally if you encode with one encoding, and then decode with another, 
you are likely to get different strings:

py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'að©Z!'
py> data.decode('iso-8859-14')
'aŵ©Z!'

Both the encode and decode methods take an optional argument, errors, 
which specify the error handling scheme. The default is errors='strict', 
which raises an exception. Others include 'ignore' and 'replace'.

py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aŵðπ©Z!'.encode('ascii', 'replace')
b'a????Z!'

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#26738

From	88888 Dihedral <dihedral88888@googlemail.com>
Date	2012-08-07 13:17 -0700
Message-ID	<4ce1aafc-7cf2-4687-ab0a-7aa42d01173b@googlegroups.com>
In reply to	#26673

Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道：
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> 
> 
> 
> > If I have a string "abcd" then, with 8-bit encoding of each character,
> 
> > there is a corresponding 32-bit binary integer. How could I best obtain
> 
> > that integer and from that integer backwards again obtain the original
> 
> > string? Thanks in advance.
> 
> 
> 
> First you have to know the encoding, as that will define the integers you 
> 
> get. There are many 8-bit encodings, but of course they can't all encode 
> 
> arbitrary 4-character strings. Since there are tens of thousands of 
> 
> different characters, and an 8-bit encoding can only code for 256 of 
> 
> them, there are many strings that an encoding cannot handle.
> 
> 
> 
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
> 
> 
> 
> Sticking to one-byte encodings: since most of them are compatible with 
> 
> ASCII, examples with "abcd" aren't very interesting:
> 
> 
> 
> py> 'abcd'.encode('latin1')
> 
> b'abcd'
> 
> 
> 
> Even though the bytes object b'abcd' is printed as if it were a string, 
> 
> it is actually treated as an array of one-byte ints:
> 
> 
> 
> py> b'abcd'[0]
> 
> 97
> 
> 
> 
> Here's a more interesting example, using Python 3: it uses at least one 
> 
> character (the Greek letter π) which cannot be encoded in Latin1, and two 
> 
> which cannot be encoded in ASCII:
> 
> 
> 
> py> "aπ©d".encode('iso-8859-7')
> 
> b'a\xf0\xa9d'
> 
> 
> 
> Most encodings will round-trip successfully:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('iso-8859-7') == text
> 
> True
> 
> 
> 
> 
> 
> (although the ability to round-trip is a property of the encoding itself, 
> 
> not of the encoding system).
> 
> 
> 
> Naturally if you encode with one encoding, and then decode with another, 
> 
> you are likely to get different strings:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('latin1')
> 
> 'að©Z!'
> 
> py> data.decode('iso-8859-14')
> 
> 'aŵ©Z!'
> 
> 
> 
> 
> 
> Both the encode and decode methods take an optional argument, errors, 
> 
> which specify the error handling scheme. The default is errors='strict', 
> 
> which raises an exception. Others include 'ignore' and 'replace'.
> 
> 
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
> 
> b'aZ!'
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
> 
> b'a????Z!'
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven



Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道：
> On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
> 
> 
> 
> > If I have a string "abcd" then, with 8-bit encoding of each character,
> 
> > there is a corresponding 32-bit binary integer. How could I best obtain
> 
> > that integer and from that integer backwards again obtain the original
> 
> > string? Thanks in advance.
> 
> 
> 
> First you have to know the encoding, as that will define the integers you 
> 
> get. There are many 8-bit encodings, but of course they can't all encode 
> 
> arbitrary 4-character strings. Since there are tens of thousands of 
> 
> different characters, and an 8-bit encoding can only code for 256 of 
> 
> them, there are many strings that an encoding cannot handle.
> 
> 
> 
> For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
> 
> 
> 
> Sticking to one-byte encodings: since most of them are compatible with 
> 
> ASCII, examples with "abcd" aren't very interesting:
> 
> 
> 
> py> 'abcd'.encode('latin1')
> 
> b'abcd'
> 
> 
> 
> Even though the bytes object b'abcd' is printed as if it were a string, 
> 
> it is actually treated as an array of one-byte ints:
> 
> 
> 
> py> b'abcd'[0]
> 
> 97
> 
> 
> 
> Here's a more interesting example, using Python 3: it uses at least one 
> 
> character (the Greek letter π) which cannot be encoded in Latin1, and two 
> 
> which cannot be encoded in ASCII:
> 
> 
> 
> py> "aπ©d".encode('iso-8859-7')
> 
> b'a\xf0\xa9d'
> 
> 
> 
> Most encodings will round-trip successfully:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('iso-8859-7') == text
> 
> True
> 
> 
> 
> 
> 
> (although the ability to round-trip is a property of the encoding itself, 
> 
> not of the encoding system).
> 
> 
> 
> Naturally if you encode with one encoding, and then decode with another, 
> 
> you are likely to get different strings:
> 
> 
> 
> py> text = 'aπ©Z!'
> 
> py> data = text.encode('iso-8859-7')
> 
> py> data.decode('latin1')
> 
> 'að©Z!'
> 
> py> data.decode('iso-8859-14')
> 
> 'aŵ©Z!'
> 
> 
> 
> 
> 
> Both the encode and decode methods take an optional argument, errors, 
> 
> which specify the error handling scheme. The default is errors='strict', 
> 
> which raises an exception. Others include 'ignore' and 'replace'.
> 
> 
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
> 
> b'aZ!'
> 
> py> 'aŵðπ©Z!'.encode('ascii', 'replace')
> 
> b'a????Z!'
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.


Actually for each kind of  some character encoding, 
please develop a codec to UTF-8 or UTF-16.

It means one can make conversions between any two of  the qualified 
character sets.

[toc] | [prev] | [standalone]

csiph-web

[newbie] String to binary conversion

Contents

#26656 — [newbie] String to binary conversion

#26658

#26660

#26661

#26662

#26668

#26673

#26738