Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'explicitly': 0.04; '"""': 0.05; 'correspond': 0.07; 'defines': 0.07; 'framework.': 0.07; 'undefined': 0.07; 'python': 0.09; 'semantics': 0.09; 'undefined.': 0.09; 'unicode,': 0.09; 'url:unicode': 0.09; 'files.': 0.13; 'sat,': 0.15; 'value.': 0.15; '"should"': 0.16; '8-bit': 0.16; 'cp1252': 0.16; 'decode': 0.16; 'decoding': 0.16; 'entries,': 0.16; 'infinity': 0.16; 'iso/iec': 0.16; 'not;': 0.16; 'numerically': 0.16; 'url:ftp': 0.16; 'wrote:': 0.17; 'byte': 0.17; 'bytes': 0.17; 'unicode': 0.17; 'examples': 0.18; '(or': 0.18; 'appropriate': 0.20; 'define': 0.20; 'question.': 0.20; 'noted': 0.22; 'defined': 0.22; 'sets': 0.23; 'tables': 0.23; 'allows': 0.25; 'header:In-Reply-To:1': 0.25; 'fit': 0.26; 'am,': 0.27; 'order.': 0.27; 'pages,': 0.27; 'possibly': 0.27; 'converting': 0.27; 'entries': 0.27; 'message-id:@mail.gmail.com': 0.27; 'leaves': 0.29; 'scheme.': 0.29; 'character': 0.29; 'points': 0.29; 'source': 0.29; "we're": 0.30; 'at:': 0.31; 'code': 0.31; '(and': 0.32; 'aside': 0.32; 'could': 0.32; 'controls': 0.33; 'to:addr:python-list': 0.33; 'equal': 0.33; 'point.': 0.33; 'received:google.com': 0.34; 'done': 0.34; 'exist': 0.35; 'mapping': 0.35; 'nov': 0.35; 'especially': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'but': 0.36; 'url:org': 0.36; 'characters': 0.36; 'correctly': 0.37; 'does': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'url:docs': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'think': 0.40; 'range': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'here:': 0.62; 'letters': 0.62; 'url:public': 0.62; 'between': 0.63; 'behavior': 0.64; 'sound': 0.65; 'talking': 0.66; 'url:0': 0.67; 'euro': 0.69; 'ranges': 0.71; '"best': 0.84; 'counterparts': 0.84; 'respectively': 0.84; 'to:name:python': 0.84; 'url:dk': 0.84; 'wish.': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=xtmx2eub9fQ9o7t+3dW04gltQv0or+9VFfh8r43DRXE=; b=gLW93f9/uTW1f3bc99eb0s+6trXNoDuDYJPRWvebmtmT0ecrLxxTl40RLUFfyNGWlD DWhKiZDV856UJXEGZFp64Z2QLo7MitcPWyHUxYz3xDLGt3H3yXjao4Zzcai/UOnt4AjG 13ReeSDXTqkpPEaL6pz0HmcbT0vC6xIf+Um3yF/Nfi/g8dM/GSYeCqZJwO9uHYl+/aSc glgiZI5RVgEUSXIzVjZ9GQS16tw+S9HEQMnmOLLlorpJfszPStXFmBo/rfZeSBnYzMrD 0SfIfLQr/jiD1H8OyKznRsFBIPmYWADwo5Fsoajh5fW90sSmaI7oA7NIN0FAPofoYt4c HksA== MIME-Version: 1.0 In-Reply-To: <32ab6e2e-e1b1-41ea-8ef4-6e4f763065bf@googlegroups.com> References: <32ab6e2e-e1b1-41ea-8ef4-6e4f763065bf@googlegroups.com> From: Ian Kelly Date: Sat, 17 Nov 2012 11:08:49 -0700 Subject: Re: latin1 and cp1252 inconsistent? To: Python Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 75 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353175767 news.xs4all.nl 6875 [2001:888:2000:d::a6]:60779 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33476 On Sat, Nov 17, 2012 at 9:56 AM, wrote: > "should" is a wish. The reality is that documents (and especially URLs) e= xist that can be decoded with latin1, but will backtrace with cp1252. I see= this as a sign that a small refactorization of cp1252 is in order. The pro= posal is to change those "UNDEFINED" entries to "" entries, as is = done here: > > http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt > > and here: > > ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestf= it1252.txt The README for the "BestFit" document states: """ These tables include "best fit" behavior which is not present in the other files. Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8. """ This does not sound like appropriate behavior for a generalized conversion scheme. It is also noted that the "BestFit" document is not authoritative at: http://www.iana.org/assignments/charset-reg/windows-1252 > This is in line with the unicode standard, which says: http://www.unicode= .org/versions/Unicode6.2.0/ch16.pdf > >> There are 65 code points set aside in the Unicode Standard for compatibi= lity with the C0 >> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges o= f these code >> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond = to the 8-bit >> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1= controls), >> respectively ... There is a simple, one-to-one mapping between 7-bit (an= d 8-bit) control >> codes and the Unicode control codes: every 7-bit (or 8-bit) control code= is numerically >> equal to its corresponding Unicode code point. > > IOW: Bytes with undefined semantics in the C0/C1 range are "control codes= ", which decode to the unicode-point of equal value. > > This is exactly the section which allows latin1 to decode 0x81 to U+81, e= ven though ISO-8859-1 explicitly does not define semantics for that byte (6= .2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf) But Latin-1 explicitly defers to to the control codes for those characters. CP-1252 does not; the reason those characters are left undefined is to allow for future expansion, such as when Microsoft added the Euro sign at 0x80. Since we're talking about conversion from bytes to Unicode, I think the most authoritative source we could possibly reference would be the official ISO 10646 conversion tables for the character sets in question. I understand those are to be found here: http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT and here: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas the cp1252 mapping leaves those five codes undefined. This would seem to indicate that Python is correctly decoding CP-1252 according to the Unicode standard.