Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Thu, 20 Jun 2013 12:43:28 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: A few questiosn about encoding
References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> <mailman.2923.1370797972.3114.python-list@python.org> <kp9drh$1o0t$1@news.ntua.gr> <51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com> <kp9lo6$9l5$2@news.ntua.gr> <51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com> <kpbnmg$qvk$2@news.ntua.gr> <51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com> <77ba6b16-4b1d-47a6-9b9b-5af45335c4fe@googlegroups.com> <51c2a089$0$29973$c3e8da3$5496439d@news.astraweb.com>
In-Reply-To: <51c2a089$0$29973$c3e8da3$5496439d@news.astraweb.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.3620.1371728614.3114.python-list@python.org>
Lines: 59
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:48785

On 20/06/2013 07:26, Steven D'Aprano wrote:
> On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:
>
>> On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:
>>
>>> Gah! That's twice I've screwed that up. Sorry about that!
>>
>> Yeah, and your difficulty explaining the Unicode implementation reminds
>> me of a passage from the Python zen:
>>
>>  "If the implementation is hard to explain, it's a bad idea."
>
> The *implementation* is easy to explain. It's the names of the encodings
> which I get tangled up in.
>
You're off by one below!
>
> ASCII: Supports exactly 127 code points, each of which takes up exactly 7
> bits. Each code point represents a character.
>
128 codepoints.

> Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and
> about a gazillion other legacy charsets, all of which are mutually
> incompatible: supports anything from 127 to 65535 different code points,
> usually under 256.
>
128 to 65536 codepoints.

> UCS-2: Supports exactly 65535 code points, each of which takes up exactly
> two bytes. That's fewer than required, so it is obsoleted by:
>
65536 codepoints.

etc.

> UTF-16: Supports all 1114111 code points in the Unicode charset, using a
> variable-width system where the most popular characters use exactly two-
> bytes and the remaining ones use a pair of characters.
>
> UCS-4: Supports exactly 4294967295 code points, each of which takes up
> exactly four bytes. That is more than needed for the Unicode charset, so
> this is obsoleted by:
>
> UTF-32: Supports all 1114111 code points, using exactly four bytes each.
> Code points outside of the range 0 through 1114111 inclusive are an error.
>
> UTF-8: Supports all 1114111 code points, using a variable-width system
> where popular ASCII characters require 1 byte, and others use 2, 3 or 4
> bytes as needed.
>
>
> Ignoring the legacy charsets, only UTF-16 is a terribly complicated
> implementation, due to the surrogate pairs. But even that is not too bad.
> The real complication comes from the interactions between systems which
> use different encodings, and that's nothing to do with Unicode.
>
>