Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <51C33918.6060102@mrabarnett.plus.com>
References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> <mailman.2923.1370797972.3114.python-list@python.org> <kp9drh$1o0t$1@news.ntua.gr> <51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com> <kp9lo6$9l5$2@news.ntua.gr> <51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com> <kpbnmg$qvk$2@news.ntua.gr> <51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com> <77ba6b16-4b1d-47a6-9b9b-5af45335c4fe@googlegroups.com> <51c2a089$0$29973$c3e8da3$5496439d@news.astraweb.com> <mailman.3620.1371728614.3114.python-list@python.org> <114200cf-2d46-46cb-bb5f-7c5f8ab98a66@googlegroups.com> <CAPTjJmq_KNNnsdSbDTJC5GQrdjk8QPEHA-8B6O_drhvKVSQ5Xg@mail.gmail.com> <51C33918.6060102@mrabarnett.plus.com>
Date: Fri, 21 Jun 2013 03:21:39 +1000
Subject: Re: A few questiosn about encoding
From: Chris Angelico <rosuav@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=ISO-8859-1
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3635.1371748902.3114.python-list@python.org>
Lines: 31
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:48813

On Fri, Jun 21, 2013 at 3:17 AM, MRAB <python@mrabarnett.plus.com> wrote:
> On 20/06/2013 17:37, Chris Angelico wrote:
>>
>> On Fri, Jun 21, 2013 at 2:27 AM,  <wxjmfauth@gmail.com> wrote:
>>>
>>> And all these coding schemes have something in common,
>>> they work all with a unique set of code points, more
>>> precisely a unique set of encoded code points (not
>>> the set of implemented code points (byte)).
>>>
>>> Just what the flexible string representation is not
>>> doing, it artificially devides unicode in subsets and try
>>> to handle eache subset differently.
>>>
>>
>>
>> UTF-16 divides Unicode into two subsets: BMP characters (encoded using
>> one 16-bit unit) and astral characters (encoded using two 16-bit units
>> in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
>> builds are guilty of exactly the same crime as the hated 3.3.
>>
> UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
> bytes, and those who previously used ASCII still need only 1 byte per
> codepoint!

Yes, but there's never (AFAIK) been a Python implementation that
represents strings in UTF-8; UTF-16 was one of two options for Python
2.2 through 3.2, and is the one that jmf always seems to be measuring
against.

ChrisA