Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #61180

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

From Gene Heskett <gheskett@wdtv.com>
Subject Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
Date 2013-12-06 14:34 -0500
References <5f370a06-8d2c-4d7d-bc22-b9a489c15c59@googlegroups.com> <ae4b6a4d-fbd4-4d10-a860-9589e6045d16@googlegroups.com> <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com>
Newsgroups comp.lang.python
Message-ID <mailman.3663.1386358504.18130.python-list@python.org> (permalink)

Show all headers | View raw


On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > 
> > "unicode as a medium is universal in the same way that ASCII used to
> > be"
> 
> Ironically, your post was not Unicode.
> 
> Seriously. I am 100% serious.
> 
> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:
> 
> Content-Type: text/plain; charset=windows-1252
> 
> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
> encodings correctly (or at all!), it screws up the encoding then sends a
> reply with no charset line at all. This is one bug that cannot be blamed
> on Google Groups -- or on Unicode.
> 
> > I wrote a number of ellipsis characters ie codepoint 2026 as in:
> Actually you didn't. You wrote a number of ellipsis characters, hex byte
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
> code point U+2026 in Unicode, but the two are as distinct as ASCII and
> EBCDIC.
> 
> > Somewhere between my sending and your quoting those ellipses became
> > the replacement character FFFD
> 
> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages where there were
> a lot of different charsets in use but no conventions for specifying
> which charset was in use. Or perhaps the author was smoking crack while
> coding.
> 
> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII"
> > were to uniformly hold.
> 
> This is incorrect. People forget that ASCII has evolved since the first
> version of the standard in 1963. There have actually been five versions
> of the ASCII standard, plus one unpublished version. (And that's not
> including the things which are frequently called ASCII but aren't.)
> 
> ASCII-1963 didn't even include lowercase letters. It is also missing
> some graphic characters like braces, and included at least two
> characters no longer used, the up-arrow and left-arrow. The control
> characters were also significantly different from today.
> 
> ASCII-1965 was unpublished and unused. I don't know the details of what
> it changed.
> 
> ASCII-1967 is a lot closer to the ASCII in use today. It made
> considerable changes to the control characters, moving, adding,
> removing, or renaming at least half a dozen control characters. It
> officially added lowercase letters, braces, and some others. It
> replaced the up-arrow character with the caret and the left-arrow with
> the underscore. It was ambiguous, allowing variations and
> substitutions, e.g.:
> 
>     - character 33 was permitted to be either the exclamation
>       mark ! or the logical OR symbol |
> 
>     - consequently character 124 (vertical bar) was always
>       displayed as a broken bar آ¦, which explains why even today
>       many keyboards show it that way
> 
>     - character 35 was permitted to be either the number sign # or
>       the pound sign آ£
> 
>     - character 94 could be either a caret ^ or a logical NOT آ¬
> 
> Even the humble comma could be pressed into service as a cedilla.
> 
> ASCII-1968 didn't change any characters, but allowed the use of LF on
> its own. Previously, you had to use either LF/CR or CR/LF as newline.
> 
> ASCII-1977 removed the ambiguities from the 1967 standard.
> 
> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
> Unfortunately I haven't been able to find out what changes were made --
> I presume they were minor, and didn't affect the character set.
> 
> So as you can see, even with actual ASCII, you can have mojibake. It's
> just not normally called that. But if you are given an arbitrary ASCII
> file of unknown age, containing code 94, how can you be sure it was
> intended as a caret rather than a logical NOT symbol? You can't.
> 
> Then there are at least 30 official variations of ASCII, strictly
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
> by their users, despite the differences, e.g. replacing the dollar sign
> $ with the international currency sign آ¤, or replacing the left brace
> { with the letter s with caron إ،.
> 
> One consequence of this is that the MIME type for ASCII text is called
> "US ASCII", despite the redundancy, because many people expect "ASCII"
> alone to mean whatever national variation they are used to.
> 
> But it gets worse: there are proprietary variations on ASCII which are
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called
> "extended ASCII" character sets, which is where the problems *really*
> pile up. Invariably back in the 1980s and early 1990s people used to
> call these "ASCII" no matter that they used 8-bits and contained
> anything up to 256 characters.
> 
> Just because somebody calls something "ASCII", doesn't make it so; even
> if it is ASCII, doesn't mean you know which version of ASCII; even if
> you know which version, doesn't mean you know how to interpret certain
> codes. It simply is *wrong* to think that "good ol' plain ASCII text"
> is unambiguous and devoid of problems.
> 
> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8,
> 
> And the same applies to ASCII.
> 
> ASCII is a *seven-bit code*. It will work fine on computers where the
> word-size is seven bits. If the word-size is eight bits, or more, you
> have to pad the ASCII code. How do you do that? Pad the most-significant
> end or the least significant end? That's a choice there. How do you pad
> it, with a zero or a one? That's another choice. If your word-size is
> more than eight bits, you might even pad *both* ends.
> 
> In C, a char is defined as the smallest addressable unit of the machine
> that can contain basic character set, not necessarily eight bits.
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
> as a "byte" and/or char. Your in-memory representation of ASCII "a"
> could easily end up as bits 001100001 or 0000000001100001.
> 
> And then there is the question of whether ASCII characters should be Big
> Endian or Little Endian. I'm referring here to bit endianness, rather
> than bytes: should character 'a' be represented as bits 1100001 (most
> significant bit to the left) or 1000011 (least significant bit to the
> left)? This may be relevant with certain networking protocols. Not all
> networking protocols are big-endian, nor are all processors. The Ada
> programming language even supports both bit orders.
> 
> When transmitting ASCII characters, the networking protocol could
> include various start and stop bits and parity codes. A single 7-bit
> ASCII character might be anything up to 12 bits in length on the wire.
> It is simply naive to imagine that the transmission of ASCII codes is
> the same as the in-memory or on-disk storage of ASCII.
> 
> You're lucky to be active in a time when most common processors have
> standardized on a single bit-order, and when most (but not all) network
> protocols have done the same. But that doesn't mean that these issues
> don't exist for ASCII. If you get a message that purports to be ASCII
> text but looks like this:
> 
> "\tS\x1b\x1b{\x01u{'\x1b\x13!"
> 
> you should suspect strongly that it is "Hello World!" which has been
> accidentally bit-reversed by some rogue piece of hardware.

You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series 
terminals, anything newer than a vt100 made liberal use of the msbit in a 
character.  Having written an emulator for the vt-220, I can testify that 
really getting it right, was a right pain in the ass.  And then I added 
zmodem triggers and detections.

Cheers, Gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

Mother Earth is not flat!
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
         law-abiding citizens.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 05:52 -0800
  Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 00:58 +1100
    Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 06:17 -0800
      Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 01:25 +1100
        Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 07:04 -0800
          Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 02:08 +1100
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 08:22 -0800
                Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 16:33 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
              Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-11-28 11:43 -0500
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 04:29 +1100
                Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 13:03 +0000
                Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 08:29 -0500
                Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 14:04 +0000
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 09:11 -0800
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 17:48 +0000
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 04:54 +1100
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 18:07 +0000
                Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:56 -0500
                Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:54 -0500
                Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 18:17 -0700
                Re: [OT] Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 20:43 -0500
                Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 18:27 -0800
                Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 20:09 -0700
                Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 19:26 -0800
                Re: [OT] Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:27 +0000
                Re: [OT] Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 18:01 +1100
                Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-03 16:30 +1000
                Re: [OT] Managing Google Groups headaches Steven D'Aprano <steve@pearwood.info> - 2013-12-03 07:13 +0000
                Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-04 10:23 +1000
                Re: [OT] Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:34 +0000
                Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 15:21 +0000
                Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:09 +0000
          Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 08:40 -0700
          Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-11-28 08:23 -0800
          Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-11-28 12:23 -0500
          Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:29 -0700
            Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:37 -0800
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 11:00 -0800
                Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 12:55 -0700
              Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 19:40 +0000
              Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:50 -0700
                Re: Managing Google Groups headaches Arif Khokar <akhokar1234@wvu.edu> - 2013-11-28 19:46 -0500
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:41 +0000
                Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:17 +0000
                Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:38 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-03 17:39 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-04 13:03 +1100
                Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-05 09:47 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:42 -0800
              Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 20:39 +0000
          Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-11-28 16:41 -0500
            Re: Managing Google Groups headaches pecore@pascolo.net - 2013-11-30 14:25 +0100
              Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:40 +1100
                Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-04 15:50 +0000
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:07 +0000
                Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-04 11:21 -0500
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:33 +0000
          Re: Managing Google Groups headaches Zero Piraeus <z@etiol.net> - 2013-11-28 13:29 -0300
            Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:15 +0000
          Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:32 -0500
          Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:44 -0500
          Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:39 +0000
  Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:13 -0800
    Re: Managing Google Groups headaches Rich Kulawiec <rsk@gsp.org> - 2013-12-04 09:52 -0500
      Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-04 19:58 -0500
        Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:13 -0800
          Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 02:36 -0500
            Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:03 -0800
              Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:19 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:32 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:48 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 06:11 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 01:51 +1100
              ASCII and Unicode [was Re: Managing Google Groups headaches] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-06 19:00 +0000
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Gene Heskett <gheskett@wdtv.com> - 2013-12-06 14:34 -0500
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Roy Smith <roy@panix.com> - 2013-12-06 20:54 +0000
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 10:42 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 18:33 -0800
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 13:41 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 19:16 -0800
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 15:08 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] MRAB <python@mrabarnett.plus.com> - 2013-12-07 03:19 +0000
                Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-07 17:05 +0100
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 08:41 -0800
                Re: ASCII and Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-08 17:22 +0000
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 09:39 -0800
                Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-08 21:11 +0100
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 19:02 -0800
              Re: Managing Google Groups headaches Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-07 12:27 +1300
              Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-06 21:24 -0500
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 23:43 -0800
                Re: Managing Google Groups headaches wxjmfauth@gmail.com - 2013-12-07 02:16 -0800
                Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 11:25 +0000
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 22:49 +1100
                Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-07 11:08 -0500
                Re: Managing Google Groups headaches Rotwang <sg552@hotmail.co.uk> - 2013-12-07 16:15 +0000
                Re: Managing Google Groups headaches Tim Chase <python.list@tim.thechases.com> - 2013-12-07 10:19 -0600
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-07 08:27 -0800
                Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-07 12:04 -0500
          Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 03:07 +0000
            Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 22:40 -0500
    Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-05 02:46 +1100
    Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-12-04 08:31 -0800

csiph-web