Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33476

Re: latin1 and cp1252 inconsistent?

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <ian.g.kelly@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'explicitly': 0.04; '"""': 0.05; 'correspond': 0.07; 'defines': 0.07; 'framework.': 0.07; 'undefined': 0.07; 'python': 0.09; 'semantics': 0.09; 'undefined.': 0.09; 'unicode,': 0.09; 'url:unicode': 0.09; 'files.': 0.13; 'sat,': 0.15; 'value.': 0.15; '"should"': 0.16; '8-bit': 0.16; 'cp1252': 0.16; 'decode': 0.16; 'decoding': 0.16; 'entries,': 0.16; 'infinity': 0.16; 'iso/iec': 0.16; 'not;': 0.16; 'numerically': 0.16; 'url:ftp': 0.16; 'wrote:': 0.17; 'byte': 0.17; 'bytes': 0.17; 'unicode': 0.17; 'examples': 0.18; '(or': 0.18; 'appropriate': 0.20; 'define': 0.20; 'question.': 0.20; 'noted': 0.22; 'defined': 0.22; 'sets': 0.23; 'tables': 0.23; 'allows': 0.25; 'header:In-Reply-To:1': 0.25; 'fit': 0.26; 'am,': 0.27; 'order.': 0.27; 'pages,': 0.27; 'possibly': 0.27; 'converting': 0.27; 'entries': 0.27; 'message-id:@mail.gmail.com': 0.27; 'leaves': 0.29; 'scheme.': 0.29; 'character': 0.29; 'points': 0.29; 'source': 0.29; "we're": 0.30; 'at:': 0.31; 'code': 0.31; '(and': 0.32; 'aside': 0.32; 'could': 0.32; 'controls': 0.33; 'to:addr:python-list': 0.33; 'equal': 0.33; 'point.': 0.33; 'received:google.com': 0.34; 'done': 0.34; 'exist': 0.35; 'mapping': 0.35; 'nov': 0.35; 'especially': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'but': 0.36; 'url:org': 0.36; 'characters': 0.36; 'correctly': 0.37; 'does': 0.37; 'received:209': 0.37; 'subject:: ': 0.38; 'url:docs': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'think': 0.40; 'range': 0.60; 'skip:u 10': 0.60; 'most': 0.61; 'here:': 0.62; 'letters': 0.62; 'url:public': 0.62; 'between': 0.63; 'behavior': 0.64; 'sound': 0.65; 'talking': 0.66; 'url:0': 0.67; 'euro': 0.69; 'ranges': 0.71; '"best': 0.84; 'counterparts': 0.84; 'respectively': 0.84; 'to:name:python': 0.84; 'url:dk': 0.84; 'wish.': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=xtmx2eub9fQ9o7t+3dW04gltQv0or+9VFfh8r43DRXE=; b=gLW93f9/uTW1f3bc99eb0s+6trXNoDuDYJPRWvebmtmT0ecrLxxTl40RLUFfyNGWlD DWhKiZDV856UJXEGZFp64Z2QLo7MitcPWyHUxYz3xDLGt3H3yXjao4Zzcai/UOnt4AjG 13ReeSDXTqkpPEaL6pz0HmcbT0vC6xIf+Um3yF/Nfi/g8dM/GSYeCqZJwO9uHYl+/aSc glgiZI5RVgEUSXIzVjZ9GQS16tw+S9HEQMnmOLLlorpJfszPStXFmBo/rfZeSBnYzMrD 0SfIfLQr/jiD1H8OyKznRsFBIPmYWADwo5Fsoajh5fW90sSmaI7oA7NIN0FAPofoYt4c HksA==
MIME-Version 1.0
In-Reply-To <32ab6e2e-e1b1-41ea-8ef4-6e4f763065bf@googlegroups.com>
References <f063ebaf-89ee-4558-a762-0241efa39dcc@googlegroups.com> <pan.2012.11.17.00.33.13.539000@nowhere.com> <32ab6e2e-e1b1-41ea-8ef4-6e4f763065bf@googlegroups.com>
From Ian Kelly <ian.g.kelly@gmail.com>
Date Sat, 17 Nov 2012 11:08:49 -0700
Subject Re: latin1 and cp1252 inconsistent?
To Python <python-list@python.org>
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3775.1353175767.27098.python-list@python.org> (permalink)
Lines 75
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1353175767 news.xs4all.nl 6875 [2001:888:2000:d::a6]:60779
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:33476

Show key headers only | View raw


On Sat, Nov 17, 2012 at 9:56 AM,  <buck@yelp.com> wrote:
> "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:
>
> http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
>
> and here:
>
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme.  It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/charset-reg/windows-1252


> This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
>
>> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
>> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
>> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
>> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
>> respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
>> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
>> equal to its corresponding Unicode code point.
>
> IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.
>
> This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

But Latin-1 explicitly defers to to the control codes for those
characters.  CP-1252 does not; the reason those characters are left
undefined is to allow for future expansion, such as when Microsoft
added the Euro sign at 0x80.

Since we're talking about conversion from bytes to Unicode, I think
the most authoritative source we could possibly reference would be the
official ISO 10646 conversion tables for the character sets in
question.  I understand those are to be found here:

http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

and here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
the cp1252 mapping leaves those five codes undefined.  This would seem
to indicate that Python is correctly decoding CP-1252 according to the
Unicode standard.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 13:44 -0800
  Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 15:33 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
      Re: latin1 and cp1252 inconsistent? Dave Angel <d@davea.name> - 2012-11-16 19:05 -0500
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 17:20 -0700
      Re: latin1 and cp1252 inconsistent? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-11-18 01:48 -0500
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-16 15:27 -0800
  Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 00:33 +0000
    Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-16 18:08 -0700
    Re: latin1 and cp1252 inconsistent? buck@yelp.com - 2012-11-17 08:56 -0800
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:08 -0700
      Re: latin1 and cp1252 inconsistent? Ian Kelly <ian.g.kelly@gmail.com> - 2012-11-17 11:13 -0700
      Re: latin1 and cp1252 inconsistent? Nobody <nobody@nowhere.com> - 2012-11-17 19:15 +0000

csiph-web