Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21498
| Date | 2012-03-11 17:10 +0100 |
|---|---|
| From | Walter Dörwald <walter@livinglogic.de> |
| Organization | LivingLogic AG, Bayreuth/Germany |
| Subject | Re: Why are some unicode error handlers "encode only"? |
| References | <4f5cb8c2$0$29891$c3e8da3$5496439d@news.astraweb.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.569.1331484185.3037.python-list@python.org> (permalink) |
On 11.03.12 15:37, Steven D'Aprano wrote:
> At least two standard error handlers are documented as working for
> encoding only:
>
> xmlcharrefreplace
> backslashreplace
>
> See http://docs.python.org/library/codecs.html#codec-base-classes
>
> and http://docs.python.org/py3k/library/codecs.html
>
> Why is this? I don't see why they shouldn't work for decoding as well.
Because xmlcharrefreplace and backslashreplace are *error* handlers.
However the bytes sequence b'〹' does *not* contain any bytes that
are not decodable for e.g. the ASCII codec. So there are no errors to
handle.
> Consider this example using Python 3.2:
>
>>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
> Traceback (most recent call last):
> File "<stdin>", line 1, in<module>
> UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
> illegal multibyte sequence
>
> The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
> known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
> or can't be supported?
The byte sequence b'\xe9!' however is not something that would have been
produced by the backslashreplace error handler. b'\\xe9!' (a sequence
containing 5 bytes) would have been (and this probably would decode
without any problems with the cp932 codec).
> # This doesn't actually work.
> b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
> => r'aaa--騷--\xe9\x21--bbb'
>
> and similarly for xmlcharrefreplace.
This would require a postprocess step *after* the bytes have been
decoded. This is IMHO out of scope for Python's codec machinery.
Servus,
Walter
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Why are some unicode error handlers "encode only"? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-11 14:37 +0000 Re: Why are some unicode error handlers "encode only"? Walter Dörwald <walter@livinglogic.de> - 2012-03-11 17:10 +0100 Re: Why are some unicode error handlers "encode only"? Terry Reedy <tjreedy@udel.edu> - 2012-03-11 13:10 -0400
csiph-web