Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21494 > unrolled thread

Why are some unicode error handlers "encode only"?

Started bySteven D'Aprano <steve+comp.lang.python@pearwood.info>
First post2012-03-11 14:37 +0000
Last post2012-03-11 13:10 -0400
Articles 3 — 3 participants

Back to article view | Back to comp.lang.python


Contents

  Why are some unicode error handlers "encode only"? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-11 14:37 +0000
    Re: Why are some unicode error handlers "encode only"? Walter Dörwald <walter@livinglogic.de> - 2012-03-11 17:10 +0100
    Re: Why are some unicode error handlers "encode only"? Terry Reedy <tjreedy@udel.edu> - 2012-03-11 13:10 -0400

#21494 — Why are some unicode error handlers "encode only"?

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-03-11 14:37 +0000
SubjectWhy are some unicode error handlers "encode only"?
Message-ID<4f5cb8c2$0$29891$c3e8da3$5496439d@news.astraweb.com>
At least two standard error handlers are documented as working for 
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well. 
Consider this example using Python 3.2:

>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: 
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also 
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't 
or can't be supported?

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=> r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.



-- 
Steven

[toc] | [next] | [standalone]


#21498

FromWalter Dörwald <walter@livinglogic.de>
Date2012-03-11 17:10 +0100
Message-ID<mailman.569.1331484185.3037.python-list@python.org>
In reply to#21494
On 11.03.12 15:37, Steven D'Aprano wrote:

> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this? I don't see why they shouldn't work for decoding as well.

Because xmlcharrefreplace and backslashreplace are *error* handlers. 
However the bytes sequence b'&#12345;' does *not* contain any bytes that 
are not decodable for e.g. the ASCII codec. So there are no errors to 
handle.

> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?

The byte sequence b'\xe9!' however is not something that would have been 
produced by the backslashreplace error handler. b'\\xe9!' (a sequence 
containing 5 bytes) would have been (and this probably would decode 
without any problems with the cp932 codec).

> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> =>  r'aaa--騷--\xe9\x21--bbb'
>
> and similarly for xmlcharrefreplace.

This would require a postprocess step *after* the bytes have been 
decoded. This is IMHO out of scope for Python's codec machinery.

Servus,
    Walter

[toc] | [prev] | [next] | [standalone]


#21499

FromTerry Reedy <tjreedy@udel.edu>
Date2012-03-11 13:10 -0400
Message-ID<mailman.570.1331485852.3037.python-list@python.org>
In reply to#21494
On 3/11/2012 10:37 AM, Steven D'Aprano wrote:
> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this?

I presume the purpose of both is to facilitate transmission of unicode 
text via byte transmission by extending incomplete byte encodings by 
replacing unicode chars that do not fit in the given encoding by a ascii 
byte sequence that will fit.

> I don't see why they shouldn't work for decoding as well.
> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?
>
> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> =>  r'aaa--騷--\xe9\x21--bbb'

This output does not round-trip and would be a bit of a fib since it 
somewhat misrepresents what the encoded bytes were:

 >>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932")
b'aaa--\xe9z--\\xe9\\x21--bbb'
 >>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932")
'aaa--騷--\\xe9\\x21--bbb'

Python 3 added surrogateescape error handling to solve this problem.

> and similarly for xmlcharrefreplace.

Since xml character references are representations of unicode chars, and 
not bytes, I do not see how that would work. By analogy, perhaps you 
mean to have '&#e9;&#21;' in your output instead of '\xe9\x21', but 
those would not properly be xml numeric character references.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web