Path: csiph.com!weretis.net!feeder9.news.weretis.net!panix!.POSTED.panix5.panix.com!qz!not-for-mail From: Eli the Bearded <*@eli.users.panix.com> Newsgroups: comp.os.linux.misc,alt.folklore.computers Subject: Re: ISO 8859-1 ("Latin 1") (was: Recent history of vi) Date: Thu, 20 Nov 2025 02:09:45 -0000 (UTC) Organization: Some absurd concept Message-ID: References: <10fig07$8oe$1@news.misty.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Date: Thu, 20 Nov 2025 02:09:45 -0000 (UTC) Injection-Info: reader2.panix.com; posting-host="panix5.panix.com:166.84.1.5"; logging-data="17539"; mail-complaints-to="abuse@panix.com" User-Agent: Vectrex rn 2.1 (beta) X-Liz: It's actually happened, the entire Internet is a massive game of Redcode X-Motto: "Erosion of rights never seems to reverse itself." -- kenny@panix X-US-Congress: Moronic Fucks. X-Attribution: EtB XFrom: is a real address Encrypted: double rot-13 Xref: csiph.com comp.os.linux.misc:77749 alt.folklore.computers:232243 In comp.os.linux.misc, Michael Bäuerle wrote: > ISO 8859-1 ("Latin 1") is a special case. No mapping table is required > for conversion to Unicode, because all ISO 8859-1 codepoints have 1:1 > mappings to Unicode codepoints. This means any UTF can be directly > applied to ISO 8859-1 codepoints. ... > The MIME declaration "ISO-8859-1" includes CO and C1 control characters. Be technical. The MIME charset ISO-8859-1 includes the CO and C1 control characters and has all of its characters at the same codepoints as Unicode but the character encoding is different from all Unicode character encodings. "charset" is a very specific term from MIME and it conflates character set with character encoding. In a world were all characters fit in eight bits, that's a very easy mistake to make, but since the MIME designers were aware of (and specifically working to accomodate) worlds where 8-bit encodings might not be used, that's was a poor choice. charset="utf-8" is an encoding using variable lengths for all of the codepoints in the Unicode character set. In UTF-8, codepoints that are under 128 are encoded in a single octet with the highbit unset. All codepoints over 127 are encoded in multiple octets all with the highbit set. charset="utf-7" is an encoding using variable lengths for many of the codepoints in the Unicode character set. In UTF-7 some characters are left as is, some characters (those above codepoint 65535) cannot be represented, and many characters are multibyte sequences. But critically, none of the bytes have the highbit set. charset="utf-ebcdic" is an encoding using variable lengths for all of the codepoints in the Unicode character set. In UTF-EBCDIC an encoding very similar to UTF-8 encodes Unicode codepoints five bits at a time into EBCDIC. Codepoints that are under 160 are encoded in a single octet and codepoints above 159 are encoded in multiple octets all with the highbit set. Only the C1 control chacters are native highbit set EBCDIC. Elijah ------ here is the map to the map you want