Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '16,': 0.03; 'broken': 0.04; 'encoding': 0.05; 'one?': 0.05; 'that?': 0.05; '*not*': 0.07; 'conventions': 0.07; 'currency': 0.07; 'permitted': 0.07; 'referring': 0.07; 'smallest': 0.07; 'users,': 0.07; "'a'": 0.09; 'ambiguity': 0.09; 'ascii': 0.09; 'bits': 0.09; 'protocols.': 0.09; 'subject: [': 0.09; 'text"': 0.09; 'bug': 0.12; 'assume': 0.14; 'wrote': 0.14; 'changes': 0.15; '"a"': 0.16; '"extended': 0.16; '"good': 0.16; '"hello': 0.16; '(also': 0.16; '(least': 0.16; '32,': 0.16; '8-bit': 0.16; 'ascii,': 0.16; 'badly.': 0.16; 'bits.': 0.16; 'brace': 0.16; 'braces,': 0.16; 'buggy': 0.16; "can't.": 0.16; 'character.': 0.16; 'charset': 0.16; 'comma': 0.16; 'cp1252': 0.16; 'devoid': 0.16; 'distinct': 0.16; 'ellipses': 0.16; 'ellipsis': 0.16; 'emulator': 0.16; 'encodings': 0.16; 'endian': 0.16; 'exclamation': 0.16; 'hardware.': 0.16; 'header:': 0.16; 'hex': 0.16; 'in-memory': 0.16; 'letters.': 0.16; 'lowercase': 0.16; 'mapped': 0.16; 'moving,': 0.16; 'pressed': 0.16; 'presume': 0.16; 'previously,': 0.16; 'processors': 0.16; 'renaming': 0.16; 'rogue': 0.16; 'roy': 0.16; 'set,': 0.16; 'so;': 0.16; 'soap,': 0.16; 'specifying': 0.16; 'subject:Unicode': 0.16; 'unambiguous': 0.16; 'underscore.': 0.16; 'unicode,': 0.16; 'unicode.': 0.16; 'uses,': 0.16; 'variations': 0.16; 'applies': 0.16; 'proprietary': 0.16; 'all.': 0.16; 'language': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'bit': 0.19; '(but': 0.19; 'normally': 0.19; 'replacing': 0.19; 'version.': 0.19; 'written': 0.21; 'machine': 0.22; 'appears': 0.22; 'programming': 0.22; 'byte': 0.24; 'certainly': 0.24; 'char': 0.24; 'interpret': 0.24; 'logical': 0.24; 'own.': 0.24; 'sends': 0.24; 'unicode': 0.24; 'fine': 0.24; 'versions': 0.24; 'cheers,': 0.24; "haven't": 0.24; 'looks': 0.24; '(or': 0.24; 'question': 0.24; 'skip:" 30': 0.26; 'this:': 0.26; 'post': 0.26; 'least': 0.26; 'somewhere': 0.26; 'certain': 0.27; 'defined': 0.27; 'gets': 0.27; 'header:In-Reply- To:1': 0.27; 'point': 0.28; 'dollar': 0.74; 'eight': 0.74; 'million': 0.74; 'received:204': 0.75; '100%': 0.77; 'hand': 0.80; '*really*': 0.84; '1980s': 0.84; 'ages': 0.84; 'bar)': 0.84; 'characters,': 0.84; 'citizens.': 0.84; 'consequently': 0.84; 'differences,': 0.84; 'dozens': 0.84; 'illustrated': 0.84; 'pad': 0.84; 'pain': 0.84; 'parity': 0.84; 'serious.': 0.84; 'sets,': 0.84; 'smoking': 0.84; 'subject:Managing': 0.84; 'transmitting': 0.84; 'world!"': 0.84; 'boxes': 0.91; 'dozen': 0.91; 'officially': 0.91; 'rusi': 0.91; 'choice.': 0.93; 'imagine': 0.93; 'lucky': 0.93; 'hands': 0.96; '2013': 0.98 X-Spam-Status: No, score=1.5 required=5.0 X-Spam-Level: + From: Gene Heskett To: python-list@python.org Subject: Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Date: Fri, 6 Dec 2013 14:34:54 -0500 References: <5f370a06-8d2c-4d7d-bc22-b9a489c15c59@googlegroups.com> <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="windows-1256" Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 177 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1386358504 news.xs4all.nl 2875 [2001:888:2000:d::a6]:42032 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:61180 On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine: > On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote: > > Evidently (and completely inadvertently) this exchange has just > > illustrated one of the inadmissable assumptions: > > > > "unicode as a medium is universal in the same way that ASCII used to > > be" > > Ironically, your post was not Unicode. > > Seriously. I am 100% serious. > > Your post was sent using a legacy encoding, Windows-1252, also known as > CP-1252, which is most certainly *not* Unicode. Whatever software you > used to send the message correctly flagged it with a charset header: > > Content-Type: text/plain; charset=windows-1252 > > Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle > encodings correctly (or at all!), it screws up the encoding then sends a > reply with no charset line at all. This is one bug that cannot be blamed > on Google Groups -- or on Unicode. > > > I wrote a number of ellipsis characters ie codepoint 2026 as in: > Actually you didn't. You wrote a number of ellipsis characters, hex byte > \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to > code point U+2026 in Unicode, but the two are as distinct as ASCII and > EBCDIC. > > > Somewhere between my sending and your quoting those ellipses became > > the replacement character FFFD > > Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about > encodings and character sets. It doesn't just assume things are ASCII, > but makes a half-hearted attempt to be charset-aware, but badly. I can > only imagine that it was written back in the Dark Ages where there were > a lot of different charsets in use but no conventions for specifying > which charset was in use. Or perhaps the author was smoking crack while > coding. > > > Leaving aside whose fault this is (very likely buggy google groups), > > this mojibaking cannot happen if the assumption "All text is ASCII" > > were to uniformly hold. > > This is incorrect. People forget that ASCII has evolved since the first > version of the standard in 1963. There have actually been five versions > of the ASCII standard, plus one unpublished version. (And that's not > including the things which are frequently called ASCII but aren't.) > > ASCII-1963 didn't even include lowercase letters. It is also missing > some graphic characters like braces, and included at least two > characters no longer used, the up-arrow and left-arrow. The control > characters were also significantly different from today. > > ASCII-1965 was unpublished and unused. I don't know the details of what > it changed. > > ASCII-1967 is a lot closer to the ASCII in use today. It made > considerable changes to the control characters, moving, adding, > removing, or renaming at least half a dozen control characters. It > officially added lowercase letters, braces, and some others. It > replaced the up-arrow character with the caret and the left-arrow with > the underscore. It was ambiguous, allowing variations and > substitutions, e.g.: > > - character 33 was permitted to be either the exclamation > mark ! or the logical OR symbol | > > - consequently character 124 (vertical bar) was always > displayed as a broken bar ¦, which explains why even today > many keyboards show it that way > > - character 35 was permitted to be either the number sign # or > the pound sign £ > > - character 94 could be either a caret ^ or a logical NOT ¬ > > Even the humble comma could be pressed into service as a cedilla. > > ASCII-1968 didn't change any characters, but allowed the use of LF on > its own. Previously, you had to use either LF/CR or CR/LF as newline. > > ASCII-1977 removed the ambiguities from the 1967 standard. > > The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). > Unfortunately I haven't been able to find out what changes were made -- > I presume they were minor, and didn't affect the character set. > > So as you can see, even with actual ASCII, you can have mojibake. It's > just not normally called that. But if you are given an arbitrary ASCII > file of unknown age, containing code 94, how can you be sure it was > intended as a caret rather than a logical NOT symbol? You can't. > > Then there are at least 30 official variations of ASCII, strictly > speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" > by their users, despite the differences, e.g. replacing the dollar sign > $ with the international currency sign ¤, or replacing the left brace > { with the letter s with caron š. > > One consequence of this is that the MIME type for ASCII text is called > "US ASCII", despite the redundancy, because many people expect "ASCII" > alone to mean whatever national variation they are used to. > > But it gets worse: there are proprietary variations on ASCII which are > commonly called "ASCII" but aren't, including dozens of 8-bit so-called > "extended ASCII" character sets, which is where the problems *really* > pile up. Invariably back in the 1980s and early 1990s people used to > call these "ASCII" no matter that they used 8-bits and contained > anything up to 256 characters. > > Just because somebody calls something "ASCII", doesn't make it so; even > if it is ASCII, doesn't mean you know which version of ASCII; even if > you know which version, doesn't mean you know how to interpret certain > codes. It simply is *wrong* to think that "good ol' plain ASCII text" > is unambiguous and devoid of problems. > > > With unicode there are in-memory formats, transportation formats eg > > UTF-8, > > And the same applies to ASCII. > > ASCII is a *seven-bit code*. It will work fine on computers where the > word-size is seven bits. If the word-size is eight bits, or more, you > have to pad the ASCII code. How do you do that? Pad the most-significant > end or the least significant end? That's a choice there. How do you pad > it, with a zero or a one? That's another choice. If your word-size is > more than eight bits, you might even pad *both* ends. > > In C, a char is defined as the smallest addressable unit of the machine > that can contain basic character set, not necessarily eight bits. > Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits > as a "byte" and/or char. Your in-memory representation of ASCII "a" > could easily end up as bits 001100001 or 0000000001100001. > > And then there is the question of whether ASCII characters should be Big > Endian or Little Endian. I'm referring here to bit endianness, rather > than bytes: should character 'a' be represented as bits 1100001 (most > significant bit to the left) or 1000011 (least significant bit to the > left)? This may be relevant with certain networking protocols. Not all > networking protocols are big-endian, nor are all processors. The Ada > programming language even supports both bit orders. > > When transmitting ASCII characters, the networking protocol could > include various start and stop bits and parity codes. A single 7-bit > ASCII character might be anything up to 12 bits in length on the wire. > It is simply naive to imagine that the transmission of ASCII codes is > the same as the in-memory or on-disk storage of ASCII. > > You're lucky to be active in a time when most common processors have > standardized on a single bit-order, and when most (but not all) network > protocols have done the same. But that doesn't mean that these issues > don't exist for ASCII. If you get a message that purports to be ASCII > text but looks like this: > > "\tS\x1b\x1b{\x01u{'\x1b\x13!" > > you should suspect strongly that it is "Hello World!" which has been > accidentally bit-reversed by some rogue piece of hardware. You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series terminals, anything newer than a vt100 made liberal use of the msbit in a character. Having written an emulator for the vt-220, I can testify that really getting it right, was a right pain in the ass. And then I added zmodem triggers and detections. Cheers, Gene -- "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Genes Web page Mother Earth is not flat! A pen in the hand of this president is far more dangerous than 200 million guns in the hands of law-abiding citizens.