Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21499
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Subject | Re: Why are some unicode error handlers "encode only"? |
| Date | 2012-03-11 13:10 -0400 |
| References | <4f5cb8c2$0$29891$c3e8da3$5496439d@news.astraweb.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.570.1331485852.3037.python-list@python.org> (permalink) |
On 3/11/2012 10:37 AM, Steven D'Aprano wrote:
> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this?
I presume the purpose of both is to facilitate transmission of unicode
text via byte transmission by extending incomplete byte encodings by
replacing unicode chars that do not fit in the given encoding by a ascii
byte sequence that will fit.
> I don't see why they shouldn't work for decoding as well.
> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
> File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?
>
> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> => r'aaa--騷--\xe9\x21--bbb'
This output does not round-trip and would be a bit of a fib since it
somewhat misrepresents what the encoded bytes were:
>>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932")
b'aaa--\xe9z--\\xe9\\x21--bbb'
>>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932")
'aaa--騷--\\xe9\\x21--bbb'
Python 3 added surrogateescape error handling to solve this problem.
> and similarly for xmlcharrefreplace.
Since xml character references are representations of unicode chars, and
not bytes, I do not see how that would work. By analogy, perhaps you
mean to have '&#e9;' in your output instead of '\xe9\x21', but
those would not properly be xml numeric character references.
--
Terry Jan Reedy
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Why are some unicode error handlers "encode only"? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-11 14:37 +0000 Re: Why are some unicode error handlers "encode only"? Walter Dörwald <walter@livinglogic.de> - 2012-03-11 17:10 +0100 Re: Why are some unicode error handlers "encode only"? Terry Reedy <tjreedy@udel.edu> - 2012-03-11 13:10 -0400
csiph-web