Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.musoftware.de!wum.musoftware.de!fu-berlin.de!uni-berlin.de!news.dfncis.de!not-for-mail From: Peter Daum Newsgroups: comp.lang.python Subject: Re: "convert" string to bytes without changing data (encoding) Date: Thu, 29 Mar 2012 16:57:19 +0200 Lines: 44 Message-ID: <4F74784F.40804@cs.tu-berlin.de> References: <9tg21lFmo3U1@mid.dfncis.de> <9tg4qoFbfpU1@mid.dfncis.de> <9th0u8Fuf2U1@mid.dfncis.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Trace: news.dfncis.de xyxH867FzWJbd2sFmbQxNgxj0tYcSYhJqoOT4n/XwhzC7/AbUHDixfQHfC Cancel-Lock: sha1:ziWsMI1RxkV6vCLFWHovfE0z7oc= User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.1.16) Gecko/20101125 Thunderbird/3.0.11 In-Reply-To: Xref: csiph.com comp.lang.python:22340 On 2012-03-28 23:37, Terry Reedy wrote: > 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' > chars. When done, encode back to 'latin-1' and the non-ascii chars will > be as they originally were. ... actually, in the beginning of my quest, I ran into an decoding exception trying to read data as "latin1" (which was more or less what I had expected anyway because byte values between 128 and 160 are not defined there). Obviously, I must have misinterpreted something there; I just ran a little test: l=[i for i in range(256)]; b=bytes(l) s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1') for c in s: print(hex(ord(c)), end=' ') if (ord(c)+1) % 16 ==0: print("") print() ... and got all the original bytes back. So it looks like I tried to solve a problem that did not exist to start with (the problems, I ran into then were pretty real, though ;-) > 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This > reversibly encodes the unknown non-ascii chars as 'illegal' non-chars > (using the surrogate-pair second-half code units). This is probably the > safest in that invalid operations on the non-chars should raise an > exception. Re-encoding with the same setting will reproduce the original > hi-bit chars. The main danger is passing the illegal strings out of your > local sandbox. Unfortunately, this is a very well-kept secret unless you know that something with that name exists. The options currently mentioned in the documentation are not really helpful, because the non-decodeable will be lost. With some trying, I got it to work, too (the option is named "surrogateescape" without the "_" and in python 3.1 it exists, but only not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...) Thank you very much for your constructive advice! Regards, Peter