Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!news.stack.nl!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '"this': 0.03; 'encoding': 0.05; 'subject:Python': 0.06; '*not*': 0.07; 'utf-8': 0.07; 'string': 0.09; "'a'": 0.09; 'ascii': 0.09; 'bits': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'type,': 0.09; 'cc:addr:python- list': 0.11; 'python': 0.11; '(there': 0.16; '33,': 0.16; 'ascii,': 0.16; 'backslash': 0.16; 'byte,': 0.16; 'clear.': 0.16; 'cons': 0.16; 'does,': 0.16; 'encoding.': 0.16; 'encodings,': 0.16; 'exclamation': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'hex': 0.16; 'integers;': 0.16; 'sign,': 0.16; 'two,': 0.16; 'unicode,': 0.16; '(you': 0.16; 'followed': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'bit': 0.19; 'else,': 0.19; '>>>': 0.22; 'saying': 0.22; 'cc:addr:python.org': 0.22; 'adds': 0.24; 'byte': 0.24; 'instance,': 0.24; 'specifies': 0.24; 'specify': 0.24; 'string,': 0.24; 'unicode': 0.24; 'looks': 0.24; 'cc:2**0': 0.24; 'nearly': 0.26; 'somewhere': 0.26; 'header:In-Reply-To:1': 0.27; 'michael': 0.29; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; 'statement': 0.30; 'message-id:@mail.gmail.com': 0.30; 'along': 0.30; '>>>>': 0.31; 'subject:some': 0.31; 'figure': 0.32; 'quite': 0.32; 'text': 0.33; 'computer.': 0.33; 'sense': 0.34; 'could': 0.34; "can't": 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'representing': 0.36; 'sequence': 0.36; 'two': 0.37; 'list': 0.37; 'represent': 0.38; 'convention': 0.38; 'mapping': 0.38; 'how': 0.40; 'even': 0.60; 'then,': 0.60; 'no.': 0.61; 'numbers': 0.61; 'first': 0.61; 'times': 0.62; 'high': 0.63; 'more': 0.64; 'different': 0.65; 'between': 0.67; 'yes': 0.68; 'lowest': 0.74; 'hand': 0.80; 'low': 0.83; '92,': 0.84; 'about?': 0.84; 'ethan': 0.84; 'furman': 0.84; 'huh?': 0.84; 'packing': 0.84; 'parity': 0.84; 'to:none': 0.92; 'hand,': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=RX2bZkzQ4TsJhgw/9L2owbsNSwEcg/ZkOwfFz4btw90=; b=PbfIxq4gqHGu/VRCYS89bK9acyn5eC14wPGBV27mNoJTLIVJV/vhpHAJFkvohVgUPj jA6Hf7FIsBQQfOWL31tYNhIg94QPnHhGOEfNqFE+Js4X9/2qGXmFSYs0Eg0fVrErkWoJ vRrox/KiDpcoaa05H7U6/n7nQzpHCL1qasCtDjiSHRIsz65VcRmRbYhB1KqyZTFjOZZs Z9kfqEeXmQP0jrgfvXJQm1wonaXY8aUxY9sCdlh8WDmhYpxYgyQ1WfQZHBpln00YGMkm 5dJ/ciYe8h9DdX3u638rP32RN+1bjNJwcNDYJZkGhBM/QsvtGN6anSeEYSwNio/IQhFF 5TVw== MIME-Version: 1.0 X-Received: by 10.52.244.84 with SMTP id xe20mr5472046vdc.3.1402069850384; Fri, 06 Jun 2014 08:50:50 -0700 (PDT) In-Reply-To: <87a99q5a08.fsf@elektro.pacujo.net> References: <538C5BB8.1020702@chamonix.reportlab.co.uk> <538f1a61$0$29978$c3e8da3$5496439d@news.astraweb.com> <53902bb1$0$11109$c3e8da3@news.astraweb.com> <87wqcvu20h.fsf@elektro.pacujo.net> <7b3543f6-6f62-49c5-abdc-e2783fd6d629@googlegroups.com> <87oay7tnxt.fsf@elektro.pacujo.net> <87tx7z5hvw.fsf@elektro.pacujo.net> <87egz25dsd.fsf@elektro.pacujo.net> <87a99q5a08.fsf@elektro.pacujo.net> Date: Sat, 7 Jun 2014 01:50:50 +1000 Subject: Re: Python 3.2 has some deadly infection From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 50 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1402069852 news.xs4all.nl 2829 [2001:888:2000:d::a6]:34158 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:72857 On Sat, Jun 7, 2014 at 1:32 AM, Marko Rauhamaa wrote: > Michael Torrie : > >> On 06/06/2014 08:10 AM, Marko Rauhamaa wrote: >>> Ethan Furman : >>>> ASCII is *not* the state of "this string has no encoding" -- that >>>> would be Unicode; a Unicode string, as a data type, has no encoding. >>> >>> Huh? >> >> [...] >> >> What part of his statement are you saying "Huh?" about? > > Unicode, like ASCII, is a code. Representing text in unicode is > encoding. Yes and no. "ASCII" means two things: Firstly, it's a mapping from the letter A to the number 65, from the exclamation mark to 33, from the backslash to 92, and so on. And secondly, it's an encoding of those numbers into the lowest seven bits of a byte, with the high byte left clear. Between those two, you get a means of representing the letter 'A' as the byte 0x41, and one of them is an encoding. "Unicode", on the other hand, is only the first part. It maps all the same characters to the same numbers that ASCII does, and then adds a few more... a few followed by a few, followed by... okay, quite a lot more. Unicode specifies that the character OK HAND SIGN, which looks like =F0=9F=91=8C if you have the right font, is number 1F44C in hex (12807= 6 decimal). This is the "Universal Character Set" or UCS. ASCII could specify a single encoding, because that encoding makes sense for nearly all purposes. (There are times when you transmit ASCII text and use the high bit to mean something else, like parity or "this is the end of a word" or something, but even then, you follow the same convention of packing a number into the low seven bits of a byte.) Unicode can't, because there are many different pros and cons to the different encodings, and so we have UCS Transformation Formats like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint to a sequence of bytes. You can't represent text in "Unicode" in a computer. Somewhere along the way, you have to figure out how to store those codepoints as bytes, or something more concrete (you could, for instance, use a Python list of Python integers; I can't say that it would be in any way more efficient than alternatives, but it would be plausible); and that's the encoding. ChrisA