Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: unicode is a fail Date: Wed, 09 Dec 2015 12:04:49 -0800 Organization: None to speak of Lines: 46 Message-ID: References: <2qyvC0.96Q.SQT8q@gmail.com> <77d7b808-27fc-48aa-b24f-53f9636a6634@googlegroups.com> <87d1ui1i2i.fsf@bsb.me.uk> <44bna0lwqh.fsf@be-well.ilk.org> <87a8pk7tb8.fsf@bsb.me.uk> <448ac191-4d48-4160-bc44-e8ff696ca284@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: mx02.eternal-september.org; posting-host="945944de09706c9b4e29b53c9d2efdc2"; logging-data="30148"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jcVKtmpdNd2XX2igwlvY4" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) Cancel-Lock: sha1:E3NqCbwI0Tr9sQp9k734su+XOWs= sha1:4aNkLGz0Fy7yvpjjYvfxDYOgU9g= Xref: csiph.com comp.lang.c:78280 supercat@casperkitty.com writes: > On Wednesday, December 9, 2015 at 11:35:32 AM UTC-6, Keith Thompson wrote: >> But the ASCII control characters 0..31 and 127 are *very* useful >> and necessary. Neither vi nor emacs would work without them. > > Codes 127/255 are an interesting case. The purpose of 127/255 was not to > perform an action, but rather to be a nop alternative to 0. A blank row > of punch-tape reads as zero; an all-holes-punched row reads as FF. If the > operator of an ASR-33 was typing a story and made a mistake, the procedure > for making a correction was to push the back-one-row button on the punch > (which mechanically moved the paper back one row without sending any sort > of code) and then punch the "rub-out" button which sent code 127/255. The > existence of the rub-out character on the tape would increase transmission > time by a tenth of a second, but not have any other adverse consequences. Sure -- but code 127 (in ASCII, Latin-1, and Unicode) is DEL, which a control character used in interactive input. It's commonly denotes deleting a character, but only because of the mnemonic name, not because it has 7 bits set to 1. And 255 is LATIN SMALL LETTER Y WITH DIAERESIS. The history of the all-rows-punched semantics is interesting, but it doesn't directly affect modern usage. > As for codes 0x80-0x9F, those were set aside I think because some terminals > regard 0x80-0xFF as synonymous with 0x00-0x7F on reception, which meant that > if a terminal was being used for display-only purposes there was no need to > worry about parity settings. If one sent a document with 8-bit chracter > data to a terminal configured for 7 bits ignore parity, characters beyond > 0xA0 would show up as alternative characters, but everything else would > appear as it should. If the document used characters 0x80-0x9F as printable > characters, they could cause the appearance of other characters to be > garbled. I don't know why they were *originally set aside, but certainly Latin-N and Unicode don't treat them as equivalent to the 0..31 control characters. For example, U+0006 is ACKNOWLEDGE or ACK, and U+0086 is START OF SELECTED AREA. And Windows-1252 has printable characters in (most of) the range 128..160; as far as I know that hasn't caused any problems other than incompatibility with non-Windows character sets. (Windows-1252 apparently was originally intended to be an ANSI standard, but ISO 8859 went in a different diretion for some reason.) -- Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"