Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.musoftware.de!wum.musoftware.de!fu-berlin.de!uni-berlin.de!news.dfncis.de!not-for-mail
From: Peter Daum <gator@cs.tu-berlin.de>
Newsgroups: comp.lang.python
Subject: Re: "convert" string to bytes without changing data (encoding)
Date: Thu, 29 Mar 2012 16:57:19 +0200
Lines: 44
Message-ID: <4F74784F.40804@cs.tu-berlin.de>
References: <9tg21lFmo3U1@mid.dfncis.de>	<mailman.1065.1332925364.3037.python-list@python.org>	<9tg4qoFbfpU1@mid.dfncis.de>	<mailman.1069.1332931371.3037.python-list@python.org>	<9th0u8Fuf2U1@mid.dfncis.de> <mailman.1098.1332970699.3037.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Trace: news.dfncis.de xyxH867FzWJbd2sFmbQxNgxj0tYcSYhJqoOT4n/XwhzC7/AbUHDixfQHfC
Cancel-Lock: sha1:ziWsMI1RxkV6vCLFWHovfE0z7oc=
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.1.16) Gecko/20101125 Thunderbird/3.0.11
In-Reply-To: <mailman.1098.1332970699.3037.python-list@python.org>
Xref: csiph.com comp.lang.python:22340

On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

  l=[i for i in range(256)]; b=bytes(l)
  s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
  for c in s:
      print(hex(ord(c)), end=' ')
      if (ord(c)+1) % 16 ==0: print("")
  print()

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
                                 Peter