Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #61219

Re: ASCII and Unicode [was Re: Managing Google Groups headaches]

Newsgroups comp.lang.python
Date 2013-12-06 18:33 -0800
References (3 earlier) <roy-9C2ADB.19585404122013@news.panix.com> <51007240-6bc9-4f0b-9937-4883bcc0ceb6@googlegroups.com> <roy-1384C7.02363006122013@news.panix.com> <ae4b6a4d-fbd4-4d10-a860-9589e6045d16@googlegroups.com> <52a21ec1$0$30003$c3e8da3$5496439d@news.astraweb.com>
Message-ID <41e2ac37-d751-43b4-a720-e8da58429420@googlegroups.com> (permalink)
Subject Re: ASCII and Unicode [was Re: Managing Google Groups headaches]
From rusi <rustompmody@gmail.com>

Show all headers | View raw


On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote:
> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > "unicode as a medium is universal in the same way that ASCII used to be"

> Ironically, your post was not Unicode.

> Seriously. I am 100% serious.

> Your post was sent using a legacy encoding, Windows-1252, also known as 
> CP-1252, which is most certainly *not* Unicode. Whatever software you 
> used to send the message correctly flagged it with a charset header:

> Content-Type: text/plain; charset=windows-1252

> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
> encodings correctly (or at all!), it screws up the encoding then sends a 
> reply with no charset line at all. This is one bug that cannot be blamed 
> on Google Groups -- or on Unicode.

> > I wrote a number of ellipsis characters ie codepoint 2026 as in:

> Actually you didn't. You wrote a number of ellipsis characters, hex byte 
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
> code point U+2026 in Unicode, but the two are as distinct as ASCII and 
> EBCDIC.

> > Somewhere between my sending and your quoting those ellipses became the
> > replacement character FFFD

> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
> encodings and character sets. It doesn't just assume things are ASCII, 
> but makes a half-hearted attempt to be charset-aware, but badly. I can 
> only imagine that it was written back in the Dark Ages where there were a 
> lot of different charsets in use but no conventions for specifying which 
> charset was in use. Or perhaps the author was smoking crack while coding.

> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII" were
> > to uniformly hold.

> This is incorrect. People forget that ASCII has evolved since the first 
> version of the standard in 1963. There have actually been five versions 
> of the ASCII standard, plus one unpublished version. (And that's not 
> including the things which are frequently called ASCII but aren't.)

> ASCII-1963 didn't even include lowercase letters. It is also missing some 
> graphic characters like braces, and included at least two characters no 
> longer used, the up-arrow and left-arrow. The control characters were 
> also significantly different from today.

> ASCII-1965 was unpublished and unused. I don't know the details of what 
> it changed.

> ASCII-1967 is a lot closer to the ASCII in use today. It made 
> considerable changes to the control characters, moving, adding, removing, 
> or renaming at least half a dozen control characters. It officially added 
> lowercase letters, braces, and some others. It replaced the up-arrow 
> character with the caret and the left-arrow with the underscore. It was 
> ambiguous, allowing variations and substitutions, e.g.:

>     - character 33 was permitted to be either the exclamation 
>       mark ! or the logical OR symbol |

>     - consequently character 124 (vertical bar) was always 
>       displayed as a broken bar ¦, which explains why even today
>       many keyboards show it that way

>     - character 35 was permitted to be either the number sign # or 
>       the pound sign £

>     - character 94 could be either a caret ^ or a logical NOT ¬

> Even the humble comma could be pressed into service as a cedilla.

> ASCII-1968 didn't change any characters, but allowed the use of LF on its 
> own. Previously, you had to use either LF/CR or CR/LF as newline.

> ASCII-1977 removed the ambiguities from the 1967 standard.

> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
> Unfortunately I haven't been able to find out what changes were made -- I 
> presume they were minor, and didn't affect the character set.

> So as you can see, even with actual ASCII, you can have mojibake. It's 
> just not normally called that. But if you are given an arbitrary ASCII 
> file of unknown age, containing code 94, how can you be sure it was 
> intended as a caret rather than a logical NOT symbol? You can't.

> Then there are at least 30 official variations of ASCII, strictly 
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" 
> by their users, despite the differences, e.g. replacing the dollar sign $ 
> with the international currency sign ¤, or replacing the left brace 
> { with the letter s with caron š.

> One consequence of this is that the MIME type for ASCII text is called 
> "US ASCII", despite the redundancy, because many people expect "ASCII" 
> alone to mean whatever national variation they are used to.

> But it gets worse: there are proprietary variations on ASCII which are 
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called 
> "extended ASCII" character sets, which is where the problems *really* 
> pile up. Invariably back in the 1980s and early 1990s people used to call 
> these "ASCII" no matter that they used 8-bits and contained anything up 
> to 256 characters.

> Just because somebody calls something "ASCII", doesn't make it so; even 
> if it is ASCII, doesn't mean you know which version of ASCII; even if you 
> know which version, doesn't mean you know how to interpret certain codes. 
> It simply is *wrong* to think that "good ol' plain ASCII text" is 
> unambiguous and devoid of problems.

> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8, 

> And the same applies to ASCII. 

> ASCII is a *seven-bit code*. It will work fine on computers where the 
> word-size is seven bits. If the word-size is eight bits, or more, you 
> have to pad the ASCII code. How do you do that? Pad the most-significant 
> end or the least significant end? That's a choice there. How do you pad 
> it, with a zero or a one? That's another choice. If your word-size is 
> more than eight bits, you might even pad *both* ends.

> In C, a char is defined as the smallest addressable unit of the machine 
> that can contain basic character set, not necessarily eight bits. 
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits 
> as a "byte" and/or char. Your in-memory representation of ASCII "a" could 
> easily end up as bits 001100001 or 0000000001100001.

> And then there is the question of whether ASCII characters should be Big 
> Endian or Little Endian. I'm referring here to bit endianness, rather 
> than bytes: should character 'a' be represented as bits 1100001 (most 
> significant bit to the left) or 1000011 (least significant bit to the 
> left)? This may be relevant with certain networking protocols. Not all 
> networking protocols are big-endian, nor are all processors. The Ada 
> programming language even supports both bit orders.

> When transmitting ASCII characters, the networking protocol could include 
> various start and stop bits and parity codes. A single 7-bit ASCII 
> character might be anything up to 12 bits in length on the wire. It is 
> simply naive to imagine that the transmission of ASCII codes is the same 
> as the in-memory or on-disk storage of ASCII.

> You're lucky to be active in a time when most common processors have 
> standardized on a single bit-order, and when most (but not all) network 
> protocols have done the same. But that doesn't mean that these issues 
> don't exist for ASCII. If you get a message that purports to be ASCII 
> text but looks like this:

> "\tS\x1b\x1b{\x01u{'\x1b\x13!"

> you should suspect strongly that it is "Hello World!" which has been 
> accidentally bit-reversed by some rogue piece of hardware.

OOf! Thats a lot of data to digest! Thanks anyway.

There's one thing I want to get into:

> Your post was sent using a legacy encoding, Windows-1252, also known as 
> CP-1252, which is most certainly *not* Unicode. Whatever software you 
> used to send the message correctly flagged it with a charset header:

What the hell! I am using firefox 25.0 in debian-testing and posting via GG.

$ locale
shows me:
LANG=en_US.UTF-8

and a bunch of other things all en_US.UTF-8.

For the most part when I point FF at any site and go to view ->
character-encoding, it says Unicode (UTF-8).

However when I go to anything in the python archives:
https://mail.python.org/pipermail/python-list/2013-December/

FF shows it as Western (Windows-1252)

That seems to suggest that something is not right with the python
mailing list config. No??

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 05:52 -0800
  Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 00:58 +1100
    Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 06:17 -0800
      Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 01:25 +1100
        Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 07:04 -0800
          Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 02:08 +1100
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 08:22 -0800
                Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 16:33 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:49 +0000
            Re: Managing Google Groups headaches Alister <alister.ware@ntlworld.com> - 2013-11-28 15:50 +0000
              Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-11-28 11:43 -0500
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-11-29 04:29 +1100
                Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 13:03 +0000
                Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 08:29 -0500
                Re: Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-02 14:04 +0000
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 09:11 -0800
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 17:48 +0000
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 04:54 +1100
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-02 18:07 +0000
                Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:56 -0500
                Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-12-02 19:54 -0500
                Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 18:17 -0700
                Re: [OT] Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-02 20:43 -0500
                Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 18:27 -0800
                Re: [OT] Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-12-02 20:09 -0700
                Re: [OT] Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-02 19:26 -0800
                Re: [OT] Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-03 04:27 +0000
                Re: [OT] Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-03 18:01 +1100
                Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-03 16:30 +1000
                Re: [OT] Managing Google Groups headaches Steven D'Aprano <steve@pearwood.info> - 2013-12-03 07:13 +0000
                Re: [OT] Managing Google Groups headaches alex23 <wuwei23@gmail.com> - 2013-12-04 10:23 +1000
                Re: [OT] Managing Google Groups headaches Neil Cerutti <neilc@norwich.edu> - 2013-12-04 14:34 +0000
                Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 15:21 +0000
                Re: [OT] Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-03 12:09 +0000
          Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 08:40 -0700
          Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-11-28 08:23 -0800
          Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-11-28 12:23 -0500
          Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:29 -0700
            Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:37 -0800
              Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 11:00 -0800
                Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 12:55 -0700
              Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 19:40 +0000
              Re: Managing Google Groups headaches Michael Torrie <torriem@gmail.com> - 2013-11-28 11:50 -0700
                Re: Managing Google Groups headaches Arif Khokar <akhokar1234@wvu.edu> - 2013-11-28 19:46 -0500
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:41 +0000
                Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:17 +0000
                Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:38 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-03 17:39 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-04 13:03 +1100
                Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-05 09:47 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:42 -0800
              Re: Managing Google Groups headaches Walter Hurry <walterhurry@lavabit.com> - 2013-11-28 20:39 +0000
          Re: Managing Google Groups headaches Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-11-28 16:41 -0500
            Re: Managing Google Groups headaches pecore@pascolo.net - 2013-11-30 14:25 +0100
              Re: Managing Google Groups headaches Cameron Simpson <cs@zip.com.au> - 2013-12-04 11:40 +1100
                Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-12-04 15:50 +0000
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:07 +0000
                Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-04 11:21 -0500
                Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-04 16:33 +0000
          Re: Managing Google Groups headaches Zero Piraeus <z@etiol.net> - 2013-11-28 13:29 -0300
            Re: Managing Google Groups headaches Grant Edwards <invalid@invalid.invalid> - 2013-11-29 16:15 +0000
          Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:32 -0500
          Re: Managing Google Groups headaches Terry Reedy <tjreedy@udel.edu> - 2013-11-28 17:44 -0500
          Re: Managing Google Groups headaches Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 14:39 +0000
  Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-11-28 10:13 -0800
    Re: Managing Google Groups headaches Rich Kulawiec <rsk@gsp.org> - 2013-12-04 09:52 -0500
      Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-04 19:58 -0500
        Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-05 23:13 -0800
          Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 02:36 -0500
            Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:03 -0800
              Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:19 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 05:32 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 00:48 +1100
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 06:11 -0800
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 01:51 +1100
              ASCII and Unicode [was Re: Managing Google Groups headaches] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-06 19:00 +0000
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Gene Heskett <gheskett@wdtv.com> - 2013-12-06 14:34 -0500
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Roy Smith <roy@panix.com> - 2013-12-06 20:54 +0000
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 10:42 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 18:33 -0800
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 13:41 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] rusi <rustompmody@gmail.com> - 2013-12-06 19:16 -0800
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] Chris Angelico <rosuav@gmail.com> - 2013-12-07 15:08 +1100
                Re: ASCII and Unicode [was Re: Managing Google Groups headaches] MRAB <python@mrabarnett.plus.com> - 2013-12-07 03:19 +0000
                Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-07 17:05 +0100
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 08:41 -0800
                Re: ASCII and Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-08 17:22 +0000
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 09:39 -0800
                Re: ASCII and Unicode giacomo boffi <pecore@pascolo.net> - 2013-12-08 21:11 +0100
                Re: ASCII and Unicode rusi <rustompmody@gmail.com> - 2013-12-08 19:02 -0800
              Re: Managing Google Groups headaches Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2013-12-07 12:27 +1300
              Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-06 21:24 -0500
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-06 23:43 -0800
                Re: Managing Google Groups headaches wxjmfauth@gmail.com - 2013-12-07 02:16 -0800
                Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 11:25 +0000
                Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-07 22:49 +1100
                Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-07 11:08 -0500
                Re: Managing Google Groups headaches Rotwang <sg552@hotmail.co.uk> - 2013-12-07 16:15 +0000
                Re: Managing Google Groups headaches Tim Chase <python.list@tim.thechases.com> - 2013-12-07 10:19 -0600
                Re: Managing Google Groups headaches rusi <rustompmody@gmail.com> - 2013-12-07 08:27 -0800
                Re: Managing Google Groups headaches Ned Batchelder <ned@nedbatchelder.com> - 2013-12-07 12:04 -0500
          Re: Managing Google Groups headaches Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-07 03:07 +0000
            Re: Managing Google Groups headaches Roy Smith <roy@panix.com> - 2013-12-06 22:40 -0500
    Re: Managing Google Groups headaches Chris Angelico <rosuav@gmail.com> - 2013-12-05 02:46 +1100
    Re: Managing Google Groups headaches Travis Griggs <travisgriggs@gmail.com> - 2013-12-04 08:31 -0800

csiph-web