Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com>
References: <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Fri, 16 Nov 2012 15:33:59 -0700
Subject: Re: latin1 and cp1252 inconsistent?
To: Python <python-list@python.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3762.1353105272.27098.python-list@python.org>
Lines: 46
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:33454

On Fri, Nov 16, 2012 at 2:44 PM,  <buck@yelp.com> wrote:
> Latin1 has a block of 32 undefined characters.

These characters are not undefined.  0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

> Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five=
 undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

> Also, the html5 standard says:
>
> When a user agent [browser] would otherwise use a character encoding give=
n in the first column [ISO-8859-1, aka latin1] of the following table to ei=
ther convert content to Unicode characters or convert Unicode characters to=
 bytes, it must instead use the encoding given in the cell in the second co=
lumn of the same row [windows-1252, aka cp1252].
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#=
character-encodings-0
>
>
> The current implementation of windows-1252 isn't usable for this purpose =
(a replacement of latin1), since it will throw an error in cases that latin=
1 would succeed.

You can use a non-strict error handling scheme to prevent the error.

>>> b'hello \x81 world'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

>>> b'hello \x81 world'.decode('cp1252', 'replace')
'hello \ufffd world'
>>> b'hello \x81 world'.decode('cp1252', 'ignore')
'hello  world'