Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '16,': 0.03; '"""': 0.05; 'correspond': 0.07; 'interpreted': 0.07; 'undefined': 0.07; 'used.': 0.07; 'defined.': 0.09; 'friday,': 0.09; 'semantics': 0.09; 'to:addr:comp.lang.python': 0.09; 'undefined.': 0.09; 'url:unicode': 0.09; 'cc:addr:python-list': 0.10; "'hello": 0.16; "'replace')": 0.16; 'combinations': 0.16; 'decode': 0.16; 'iso/iec': 0.16; 'uses,': 0.16; 'wrote:': 0.17; 'unicode': 0.17; 'creates': 0.18; '>>>': 0.18; 'bit': 0.21; 'error.': 0.21; 'cc:2**0': 0.23; 'example': 0.23; 'specified': 0.23; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'handling': 0.27; 'prevent': 0.27; 'represent': 0.28; 'fri,': 0.30; 'function': 0.30; 'error': 0.30; 'code': 0.31; 'generally': 0.32; 'received:google.com': 0.34; 'loss': 0.34; 'acceptable': 0.35; 'data,': 0.35; 'nov': 0.35; 'pm,': 0.35; 'table': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'url:org': 0.36; 'characters': 0.36; 'received:209': 0.37; 'subject:: ': 0.38; 'url:docs': 0.38; 'application': 0.40; 'from:no real name:2**0': 0.60; 'skip:n 10': 0.63; 'url:0': 0.67; 'positions': 0.68; 'standards,': 0.84; 'url:dk': 0.84 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=path:newsgroups:date:in-reply-to:complaints-to:injection-info :nntp-posting-host:references:user-agent:x-google-web-client :x-google-ip:mime-version:message-id:subject:from:to:cc:content-type :content-transfer-encoding:x-gm-message-state; bh=tB/kdzTQhfu8/mrziJtYHOOWoGJp2pd706YBJz/rAaI=; b=mEe8PTpLtr7p4dpFWVUsrsfbkg+rJwfizYQDZldM3Ura5/eJ+0GpMb/Ez1WCdcQyhB gRmAkpOVuvl5kjdaKVeB0JCc9I8XzII/THY3o/J343yNyygx76vj021bSkpuvpnkasU+ MyPNjBIDFs2Qs5AgwUKarZoqLUb8bvT/G7hP1jnlBqy+zl86tNS84Sz/2Z1x+H8pylXM iulTM8T5OyonPxEC0yf+Vn2qs4lvDDdH3vNRGSatxvtTGV3QiU+Bgugq69Hs4Ut8livA 9LzxvRQGfyJcJiFgqNNh9Lg+BO8ibVwruXb0uylsZbJvTuWcKSHcBkFBAR/ZMcRaJ+xV ahkQ== Newsgroups: comp.lang.python Date: Fri, 16 Nov 2012 15:27:54 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=98.248.112.191; posting-account=64lhtQoAAAC4jcng0haBX247t-tzqGPA References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 98.248.112.191 MIME-Version: 1.0 Subject: Re: latin1 and cp1252 inconsistent? From: buck@yelp.com To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQkSm+STyrsR7CIVuZEWFNnKRekozU6dnwBD+HSHev8S8I/4e5NoqgWU+tztYWOzf4lBwadt Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 34 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353108483 news.xs4all.nl 6878 [2001:888:2000:d::a6]:57520 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33456 On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote: > On Fri, Nov 16, 2012 at 2:44 PM, wrote: >=20 > > Latin1 has a block of 32 undefined characters. >=20 >=20 > These characters are not undefined. 0x80-0x9f are the C1 control > codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and > their Unicode mappings are well defined. They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf """ The shaded positions in the code table correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429. However it's reasonable for 0x81 to decode to U+81 because the unicode stan= dard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf """ The semantics of the control codes are generally determined by the appl= ication with which they are used. However, in the absence of specific appli= cation uses, they may be interpreted according to the control function sema= ntics specified in ISO/IEC 6429:1992. > You can use a non-strict error handling scheme to prevent the error. > >>> b'hello \x81 world'.decode('cp1252', 'replace') > 'hello \ufffd world' This creates a non-reversible encoding, and loss of data, which isn't accep= table for my application.