Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.04; 'interpreter': 0.05; '"""': 0.07; 'permitted': 0.07; 'reason,': 0.07; 'ascii': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'oh,': 0.09; 'spec': 0.09; 'subject: [': 0.09; 'yeah,': 0.09; 'cc:addr:python- list': 0.11; '"(c)': 0.16; '6:00': 0.16; 'exclamation': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'or.': 0.16; 'select.': 0.16; 'specifying': 0.16; 'stuff.': 0.16; 'subject:Unicode': 0.16; 'using,': 0.16; 'value"': 0.16; 'sat,': 0.16; 'language': 0.16; 'wrote:': 0.18; 'split': 0.19; '(the': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'instance,': 0.24; 'keyboard': 0.24; 'logical': 0.24; 'cc:2**0': 0.24; 'defined': 0.27; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; 'dec': 0.30; 'message- id:@mail.gmail.com': 0.30; 'code': 0.31; "d'aprano": 0.31; 'operators': 0.31; 'steven': 0.31; 'vertical': 0.31; 'front': 0.32; 'says': 0.33; 'screen': 0.34; "i'd": 0.34; 'could': 0.34; 'equal': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'date.': 0.36; 'explains': 0.36; 'ibm': 0.36; 'material': 0.36; 'so,': 0.37; 'list': 0.37; 'subject:]': 0.38; 'rather': 0.38; 'quote': 0.39; 'either': 0.39; 'even': 0.60; 'first': 0.61; 'show': 0.63; 'personal': 0.63; 'today': 0.64; 'more': 0.64; 'different': 0.65; 'note:': 0.66; '1994.': 0.84; 'bar)': 0.84; 'characters,': 0.84; 'consequently': 0.84; 'subject:Managing': 0.84; 'to:none': 0.92; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=9bAIBA9V9FgEMT1HTY4RFc0qMaX4ImtitGhwnbucAas=; b=cKZgZnsEGRhsamUVuDGq6uAskwqwji/OaC5xx7cCJg+RFuefuV4nRK7+CakNKtHp0W yy9GEzrudZNvDVjZMq48QgDgxBl/bBNxJQTM4Xlz6Vu9Q44D5SdxDE5uH3yeM5GBQ2Zr o4JwanFhFvyxctWBMyBUJuU/p1LIoidmZuODHRUjWOnMPQ2thk301vfzKSTK1lbs59xc Zyl4U5EeIrYJZzOy066u0Li+jQ/Xx31R3HtttrLLgH7CpQ4KEdW9NV580vwm254uALlt LyxvrMQmvXlqm/e0OQHT4uqsXm30jpCgYivT+Gt6qCnthmCASuYWTIWX69BV8W5te/DT 3wjw== MIME-Version: 1.0 X-Received: by 10.68.196.193 with SMTP id io1mr7087503pbc.46.1386373333961; Fri, 06 Dec 2013 15:42:13 -0800 (PST) In-Reply-To: <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com> References: <5f370a06-8d2c-4d7d-bc22-b9a489c15c59@googlegroups.com> <132658ff-d06a-4136-ade6-353189da5769@googlegroups.com> <51007240-6bc9-4f0b-9937-4883bcc0ceb6@googlegroups.com> <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com> Date: Sat, 7 Dec 2013 10:42:13 +1100 Subject: Re: ASCII and Unicode [was Re: Managing Google Groups headaches] From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 41 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1386373692 news.xs4all.nl 2933 [2001:888:2000:d::a6]:35117 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:61194 On Sat, Dec 7, 2013 at 6:00 AM, Steven D'Aprano wrote: > - character 33 was permitted to be either the exclamation > mark ! or the logical OR symbol | > > - consequently character 124 (vertical bar) was always > displayed as a broken bar =C2=A6, which explains why even today > many keyboards show it that way > > - character 35 was permitted to be either the number sign # or > the pound sign =C2=A3 > > - character 94 could be either a caret ^ or a logical NOT =C2=AC Yeah, good fun stuff. I first met several of these ambiguities in the OS/2 REXX documentation, which detailed the language's operators by specifying their byte values as well as their characters - for instance, this quote from the docs (yeah, I still have it all here): """ Note: Depending upon your Personal System keyboard and the code page you are using, you may not have the solid vertical bar to select. For this reason, REXX also recognizes the use of the split vertical bar as a logical OR symbol. Some keyboards may have both characters. If so, they are not interchangeable; only the character that is equal to the ASCII value of 124 works as the logical OR. This type of mismatch can also cause the character on your screen to be different from the character on your keyboard. """ (The front material on the docs says "(C) Copyright IBM Corp. 1987, 1994. All Rights Reserved.") It says "ASCII value" where on this list we would be more likely to call it "byte value", and I'd prefer to say "represented by" rather than "equal to", but nonetheless, this is still clearly distinguishing characters and bytes. The language spec is on characters, but ultimately the interpreter is going to be looking at bytes, so when there's a problem, it's byte 124 that's the one defined as logical OR. Oh, and note the copyright date. The byte/char distinction isn't new. ChrisA