Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'represents': 0.05; 'remaining': 0.07; '"if': 0.09; '128': 0.09; 'ascii': 0.09; 'bytes.': 0.09; 'required,': 0.09; 'subject:few': 0.09; 'yeah,': 0.09; 'python': 0.11; '127': 0.16; '65536': 0.16; 'bits.': 0.16; 'byte,': 0.16; 'character.': 0.16; 'charset,': 0.16; 'encodings': 0.16; 'encodings,': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'ignoring': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'surrogate': 0.16; 'terribly': 0.16; 'unicode.': 0.16; 'thursday,': 0.16; 'wrote:': 0.18; 'wed,': 0.18; '>>>': 0.22; 'header:User-Agent:1': 0.23; 'bytes': 0.24; 'unicode': 0.24; "i've": 0.25; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'points': 0.29; 'characters': 0.30; 'needed.': 0.30; 'code': 0.31; 'usually': 0.31; '-0700,': 0.31; '13,': 0.31; 'bad.': 0.31; "d'aprano": 0.31; 'steven': 0.31; 'up.': 0.33; 'johnson': 0.35; 'received:84': 0.35; 'but': 0.35; 'in.': 0.36; 'that!': 0.36; 'error.': 0.37; 'too': 0.37; 'two': 0.37; 'needed': 0.38; 'to:addr:python-list': 0.38; 'anything': 0.39; 'bad': 0.39; 'to:addr:python.org': 0.39; 'even': 0.60; 'easy': 0.60; 'most': 0.60; 'range': 0.61; 'skip:* 10': 0.61; "you're": 0.61; 'real': 0.63; 'more': 0.64; 'different': 0.65; 'by:': 0.65; 'due': 0.66; 'between': 0.67; 'header:Reply-To:1': 0.67; 'difficulty': 0.68; 'reply-to:no real name:2**0': 0.71; 'inclusive': 0.84; 'points,': 0.84; 'reply-to:addr:python.org': 0.84; 'two-': 0.84; 'rick': 0.93; '2013': 0.98 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=RZapVTdv c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=oyR3mlnJdzkA:10 a=Ul6cpnYf0ckA:10 a=ihvODaAuJD4A:10 a=OUOv7kDek9cA:10 a=8nJEP1OIZ-IA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=ss88cSErNWgA:10 a=qikMrt14KvmBnc9ZhDAA:9 a=wPNLvfGTeEIA:10 X-AUTH: mrabarnett:2500 Date: Thu, 20 Jun 2013 12:43:28 +0100 From: MRAB User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: python-list@python.org Subject: Re: A few questiosn about encoding References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> <51b83e5a$0$29998$c3e8da3$5496439d@news.astraweb.com> <51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com> <51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com> <77ba6b16-4b1d-47a6-9b9b-5af45335c4fe@googlegroups.com> <51c2a089$0$29973$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: <51c2a089$0$29973$c3e8da3$5496439d@news.astraweb.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: python-list@python.org List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 59 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1371728614 news.xs4all.nl 15920 [2001:888:2000:d::a6]:35683 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:48785 On 20/06/2013 07:26, Steven D'Aprano wrote: > On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote: > >> On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: >> >>> Gah! That's twice I've screwed that up. Sorry about that! >> >> Yeah, and your difficulty explaining the Unicode implementation reminds >> me of a passage from the Python zen: >> >> "If the implementation is hard to explain, it's a bad idea." > > The *implementation* is easy to explain. It's the names of the encodings > which I get tangled up in. > You're off by one below! > > ASCII: Supports exactly 127 code points, each of which takes up exactly 7 > bits. Each code point represents a character. > 128 codepoints. > Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and > about a gazillion other legacy charsets, all of which are mutually > incompatible: supports anything from 127 to 65535 different code points, > usually under 256. > 128 to 65536 codepoints. > UCS-2: Supports exactly 65535 code points, each of which takes up exactly > two bytes. That's fewer than required, so it is obsoleted by: > 65536 codepoints. etc. > UTF-16: Supports all 1114111 code points in the Unicode charset, using a > variable-width system where the most popular characters use exactly two- > bytes and the remaining ones use a pair of characters. > > UCS-4: Supports exactly 4294967295 code points, each of which takes up > exactly four bytes. That is more than needed for the Unicode charset, so > this is obsoleted by: > > UTF-32: Supports all 1114111 code points, using exactly four bytes each. > Code points outside of the range 0 through 1114111 inclusive are an error. > > UTF-8: Supports all 1114111 code points, using a variable-width system > where popular ASCII characters require 1 byte, and others use 2, 3 or 4 > bytes as needed. > > > Ignoring the legacy charsets, only UTF-16 is a terribly complicated > implementation, due to the surrogate pairs. But even that is not too bad. > The real complication comes from the interactions between systems which > use different encodings, and that's nothing to do with Unicode. > >