Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '16,': 0.03; 'column': 0.07; 'undefined': 0.07; 'url:msdn': 0.07; '(aka': 0.09; 'bytes,': 0.09; 'defined.': 0.09; 'throw': 0.09; 'undefined.': 0.09; 'cases': 0.15; 'encoding': 0.15; "'hello": 0.16; "'replace')": 0.16; 'codec': 0.16; 'decode': 0.16; 'row': 0.16; 'wrote:': 0.17; 'byte': 0.17; 'unicode': 0.17; '>>>': 0.18; 'skip:" 30': 0.20; 'error.': 0.21; '"",': 0.22; '15,': 0.23; 'second': 0.24; 'header :In-Reply-To:1': 0.25; 'skip:[ 10': 0.26; '(most': 0.27; 'handling': 0.27; 'prevent': 0.27; 'message-id:@mail.gmail.com': 0.27; 'leaves': 0.29; 'usable': 0.29; 'character': 0.29; 'convert': 0.29; 'fri,': 0.30; 'received:209.85.215.46': 0.30; 'error': 0.30; 'file': 0.32; 'traceback': 0.33; 'to:addr:python- list': 0.33; "can't": 0.34; 'received:google.com': 0.34; 'nov': 0.35; 'pm,': 0.35; 'table': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'but': 0.36; 'url:org': 0.36; 'characters': 0.36; 'received:209': 0.37; 'subject:: ': 0.38; 'instead': 0.39; 'to:addr:python.org': 0.39; 'url:microsoft': 0.39; 'header:Received:5': 0.40; 'skip:u 10': 0.60; 'url:aspx': 0.60; 'first': 0.61; 'agent': 0.64; 'url:en-us': 0.65; 'skip:c 50': 0.66; 'succeed.': 0.84; 'to:name:python': 0.84; 'aka': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=htN8oRbQiZXueRT3ZLlQgktIXYZ0RJbOAsugg8mQyPU=; b=0IuTJEJtFlUdXtzIMpGq9EIwbUKFqA2c7yw7uwagWOKdfh66LoEx2eA4Y3JH4hrEla 0bNm7Xn2M0mYfezbcswK4A+DXJx/Gz+PEfp2Uyh8vhp5FJVAbgISxPS3Xo6Eynv1q7JT ih6n/mQjY0kzWcRDKmsE7slwMiXFeeO2TQLDuxojT2y07uc+2UsQaSppJKb+6KZDnMFG i27ny1rbH63f5asLv6EazMQZHYDQw4QP39rtpnLg86HFwAY7TlP6tmaZoPbtEYJCrEto jIovKLTzjtqZNU/+9xS2sn1z3PE3IBrUSqoVX5q3Z5LocT4W0UhTK3RR6IopSwNBomzl oU6Q== MIME-Version: 1.0 In-Reply-To: References: From: Ian Kelly Date: Fri, 16 Nov 2012 15:33:59 -0700 Subject: Re: latin1 and cp1252 inconsistent? To: Python Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 46 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353105272 news.xs4all.nl 6920 [2001:888:2000:d::a6]:47649 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33454 On Fri, Nov 16, 2012 at 2:44 PM, wrote: > Latin1 has a block of 32 undefined characters. These characters are not undefined. 0x80-0x9f are the C1 control codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and their Unicode mappings are well defined. http://tools.ietf.org/html/rfc1345 > Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five= undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D In CP 1252, these codes are actually undefined. http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx > Also, the html5 standard says: > > When a user agent [browser] would otherwise use a character encoding give= n in the first column [ISO-8859-1, aka latin1] of the following table to ei= ther convert content to Unicode characters or convert Unicode characters to= bytes, it must instead use the encoding given in the cell in the second co= lumn of the same row [windows-1252, aka cp1252]. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#= character-encodings-0 > > > The current implementation of windows-1252 isn't usable for this purpose = (a replacement of latin1), since it will throw an error in cases that latin= 1 would succeed. You can use a non-strict error handling scheme to prevent the error. >>> b'hello \x81 world'.decode('cp1252') Traceback (most recent call last): File "", line 1, in File "c:\python33\lib\encodings\cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to >>> b'hello \x81 world'.decode('cp1252', 'replace') 'hello \ufffd world' >>> b'hello \x81 world'.decode('cp1252', 'ignore') 'hello world'