Re: Why are some unicode error handlers "encode only"?

Date	2012-03-11 17:10 +0100
From	Walter Dörwald <walter@livinglogic.de>
Organization	LivingLogic AG, Bayreuth/Germany
Subject	Re: Why are some unicode error handlers "encode only"?
References	<4f5cb8c2$0$29891$c3e8da3$5496439d@news.astraweb.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.569.1331484185.3037.python-list@python.org> (permalink)

Show all headers | View raw

On 11.03.12 15:37, Steven D'Aprano wrote:

> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this? I don't see why they shouldn't work for decoding as well.

Because xmlcharrefreplace and backslashreplace are *error* handlers. 
However the bytes sequence b'&#12345;' does *not* contain any bytes that 
are not decodable for e.g. the ASCII codec. So there are no errors to 
handle.

> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?

The byte sequence b'\xe9!' however is not something that would have been 
produced by the backslashreplace error handler. b'\\xe9!' (a sequence 
containing 5 bytes) would have been (and this probably would decode 
without any problems with the cp932 codec).

> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> =>  r'aaa--騷--\xe9\x21--bbb'
>
> and similarly for xmlcharrefreplace.

This would require a postprocess step *after* the bytes have been 
decoded. This is IMHO out of scope for Python's codec machinery.

Servus,
    Walter

Thread

Why are some unicode error handlers "encode only"? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-11 14:37 +0000
  Re: Why are some unicode error handlers "encode only"? Walter Dörwald <walter@livinglogic.de> - 2012-03-11 17:10 +0100
  Re: Why are some unicode error handlers "encode only"? Terry Reedy <tjreedy@udel.edu> - 2012-03-11 13:10 -0400

csiph-web