Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33460

Re: latin1 and cp1252 inconsistent?

Path csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <ian.g.kelly@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; '16,': 0.03; '"""': 0.05; 'correspond': 0.07; 'undefined': 0.07; 'bytes,': 0.09; 'commonly': 0.09; 'encoding': 0.15; 'value.': 0.15; "'hello": 0.16; "'replace')": 0.16; '(note': 0.16; '1992,': 0.16; '8-bit': 0.16; 'assigns': 0.16; 'combinations': 0.16; 'decode': 0.16; 'iso/iec': 0.16; 'url:iso': 0.16; 'string': 0.17; 'wrote:': 0.17; 'bytes': 0.17; 'creates': 0.18; '>>>': 0.18; 'preferred': 0.20; 'bit': 0.21; 'error.': 0.21; 'example': 0.23; 'specified': 0.23; 'thus': 0.24; 'header:In-Reply-To:1': 0.25; 'url:wiki': 0.26; 'values': 0.26; 'handling': 0.27; 'possibly': 0.27; 'prevent': 0.27; 'message-id:@mail.gmail.com': 0.27; 'represent': 0.28; 'url:wikipedia': 0.29; 'character': 0.29; 'probably': 0.29; 'that.': 0.30; 'fri,': 0.30; 'error': 0.30; 'code': 0.31; 'gets': 0.32; 'to:addr:python-list': 0.33; 'presence': 0.33; 'received:google.com': 0.34; 'loss': 0.34; 'acceptable': 0.35; 'data,': 0.35; 'nov': 0.35; 'pm,': 0.35; 'table': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'really': 0.36; 'url:org': 0.36; 'characters': 0.36; "i'll": 0.36; 'should': 0.36; 'possible': 0.37; 'does': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'registered': 0.38; 'url:docs': 0.38; 'url:en': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'map': 0.61; 'skip:n 10': 0.63; 'more': 0.63; 'to,': 0.65; 'positions': 0.68; 'presented': 0.72; 'hand': 0.82; 'standards,': 0.84; 'to:name:python': 0.84; 'url:dk': 0.84; 'agents.': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=h3kag/5zluxD0fhUu8rCu2eTII9Yx1V3uZTIUKIGSt4=; b=I5YzeqAs5vNJJOExd49U5AkXEd+wdRvahEUO5u7CfkovJuUvp/FTn7xlkiuLOM4yC6 3CM86yy3HGavt2YT5x2xxqgrO/A7WTbxJJuXM8MwYP7UiqoaE7qwpVB/r4XtLljRTKTg SMDLwEaycncKdbwu6bVn/IQft7OFM919Am+ssSvL12RhFdQaYN0w/38ZRe/FhN3e2ynX iIS8OOUsW+2bRec+stbdIPYd7WHuhv5A179SMX1YOAeAAfHY6wxAKa1etgDFc2Xg908/ mwV4VjwMiwPPdTezuqQm9kRAT1rm3Y0g3Gu2lpivCecEf5aDr2ubn8TTGx+OWfTGIOwx S8Lg==
MIME-Version 1.0
In-Reply-To <4c683e33-d6cb-480c-a6da-20e3523c2103@googlegroups.com>
References <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com> <mailman.3762.1353105272.27098.python-list@python.org> <4c683e33-d6cb-480c-a6da-20e3523c2103@googlegroups.com>
From Ian Kelly <ian.g.kelly@gmail.com>
Date Fri, 16 Nov 2012 17:20:24 -0700
Subject Re: latin1 and cp1252 inconsistent?
To Python <python-list@python.org>
Content-Type text/plain; charset=ISO-8859-1
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3767.1353111656.27098.python-list@python.org> (permalink)
Lines 40
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1353111656 news.xs4all.nl 6975 [2001:888:2000:d::a6]:53468
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:33460

Show key headers only | View raw


On Fri, Nov 16, 2012 at 4:27 PM,  <buck@yelp.com> wrote:
> They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
>
> """ The shaded positions in the code table correspond
>     to bit combinations that do not represent graphic
>     characters. Their use is outside the scope of
>     ISO/IEC 8859; it is specified in other International
>     Standards, for example ISO/IEC 6429.

It gets murkier than that.  I don't want to spend time hunting down
the relevant documents, so I'll just quote from Wikipedia:

"""
In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the
extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
the Internet. This map assigns the C0 and C1 control characters to the
unassigned code values thus provides for 256 characters via every
possible 8-bit value.
"""

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History

>> You can use a non-strict error handling scheme to prevent the error.
>> >>> b'hello \x81 world'.decode('cp1252', 'replace')
>> 'hello \ufffd world'
>
> This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Well, what characters would you have these bytes decode to,
considering that they're undefined?  If the string is really CP-1252,
then the presence of undefined characters in the document does not
signify "data".  They're just junk bytes, possibly indicative of data
corruption.  If on the other hand the string is really Latin-1, and
you *know* that it is Latin-1, then you should probably forget the
aliasing recommendation and just decode it as Latin-1.

Apparently this Latin-1 -> CP-1252 encoding aliasing is already
commonly performed by modern user agents.  What do IE and Firefox do
when presented with a Latin-1 encoding and undefined CP-1252 codings?

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 13:44 -0800
  Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 15:33 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
      Re: latin1 and cp1252 inconsistent? Dave Angel <d@davea.name> - 2012-11-16 19:05 -0500
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 17:20 -0700
      Re: latin1 and cp1252 inconsistent? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-11-18 01:48 -0500
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
  Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 00:33 +0000
    Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 18:08 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-17 08:56 -0800
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:08 -0700
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:13 -0700
      Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 19:15 +0000

csiph-web