Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33454

Re: latin1 and cp1252 inconsistent?

Path csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <ian.g.kelly@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; '16,': 0.03; 'column': 0.07; 'undefined': 0.07; 'url:msdn': 0.07; '(aka': 0.09; 'bytes,': 0.09; 'defined.': 0.09; 'throw': 0.09; 'undefined.': 0.09; 'cases': 0.15; 'encoding': 0.15; "'hello": 0.16; "'replace')": 0.16; 'codec': 0.16; 'decode': 0.16; 'row': 0.16; 'wrote:': 0.17; 'byte': 0.17; 'unicode': 0.17; '>>>': 0.18; 'skip:" 30': 0.20; 'error.': 0.21; '"",': 0.22; '15,': 0.23; 'second': 0.24; 'header :In-Reply-To:1': 0.25; 'skip:[ 10': 0.26; '(most': 0.27; 'handling': 0.27; 'prevent': 0.27; 'message-id:@mail.gmail.com': 0.27; 'leaves': 0.29; 'usable': 0.29; 'character': 0.29; 'convert': 0.29; 'fri,': 0.30; 'received:209.85.215.46': 0.30; 'error': 0.30; 'file': 0.32; 'traceback': 0.33; 'to:addr:python- list': 0.33; "can't": 0.34; 'received:google.com': 0.34; 'nov': 0.35; 'pm,': 0.35; 'table': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'but': 0.36; 'url:org': 0.36; 'characters': 0.36; 'received:209': 0.37; 'subject:: ': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'url:microsoft': 0.39; 'header:Received:5': 0.40; 'skip:u 10': 0.60; 'url:aspx': 0.60; 'first': 0.61; 'agent': 0.64; 'url:en-us': 0.65; 'skip:c 50': 0.66; 'succeed.': 0.84; 'to:name:python': 0.84; 'aka': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=htN8oRbQiZXueRT3ZLlQgktIXYZ0RJbOAsugg8mQyPU=; b=0IuTJEJtFlUdXtzIMpGq9EIwbUKFqA2c7yw7uwagWOKdfh66LoEx2eA4Y3JH4hrEla 0bNm7Xn2M0mYfezbcswK4A+DXJx/Gz+PEfp2Uyh8vhp5FJVAbgISxPS3Xo6Eynv1q7JT ih6n/mQjY0kzWcRDKmsE7slwMiXFeeO2TQLDuxojT2y07uc+2UsQaSppJKb+6KZDnMFG i27ny1rbH63f5asLv6EazMQZHYDQw4QP39rtpnLg86HFwAY7TlP6tmaZoPbtEYJCrEto jIovKLTzjtqZNU/+9xS2sn1z3PE3IBrUSqoVX5q3Z5LocT4W0UhTK3RR6IopSwNBomzl oU6Q==
MIME-Version 1.0
In-Reply-To <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com>
References <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com>
From Ian Kelly <ian.g.kelly@gmail.com>
Date Fri, 16 Nov 2012 15:33:59 -0700
Subject Re: latin1 and cp1252 inconsistent?
To Python <python-list@python.org>
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3762.1353105272.27098.python-list@python.org> (permalink)
Lines 46
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1353105272 news.xs4all.nl 6920 [2001:888:2000:d::a6]:47649
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:33454

Show key headers only | View raw


On Fri, Nov 16, 2012 at 2:44 PM,  <buck@yelp.com> wrote:
> Latin1 has a block of 32 undefined characters.

These characters are not undefined.  0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345

> Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

> Also, the html5 standard says:
>
> When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
>
>
> The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.

You can use a non-strict error handling scheme to prevent the error.

>>> b'hello \x81 world'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
6: character maps to <undefined>

>>> b'hello \x81 world'.decode('cp1252', 'replace')
'hello \ufffd world'
>>> b'hello \x81 world'.decode('cp1252', 'ignore')
'hello  world'

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 13:44 -0800
  Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 15:33 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
      Re: latin1 and cp1252 inconsistent? Dave Angel <d@davea.name> - 2012-11-16 19:05 -0500
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 17:20 -0700
      Re: latin1 and cp1252 inconsistent? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-11-18 01:48 -0500
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
  Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 00:33 +0000
    Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 18:08 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-17 08:56 -0800
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:08 -0700
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:13 -0700
      Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 19:15 +0000

csiph-web