Path: csiph.com!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: Rationale for aligning data on even bytes in a Unix shell file? Date: Fri, 09 May 2025 12:20:36 -0700 Organization: None to speak of Lines: 60 Message-ID: <87selda8vv.fsf@nosuchdomain.example.com> References: MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Fri, 09 May 2025 21:20:38 +0200 (CEST) Injection-Info: dont-email.me; posting-host="302a6dd640940106301f9e87fdade96e"; logging-data="3143557"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188CYjU1w259uoWAM1RnHC1" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:n4x4UsT6heWJOTdNG1wTqHmgVsk= sha1:yu6NRIkDoeBMZa5tnXbcpZKbkNk= Xref: csiph.com comp.lang.c:393314 BGB writes: > On 5/9/2025 12:52 PM, Bonita Montero wrote: >> Am 07.05.2025 um 12:08 schrieb BGB: >> >>> If you know one side is UTF-8 and the other is UTF-16, then >>> conversion does not need to know or care which locale is in effect. >> Unicode hasn't locales, i.e. alternative meanings for the same code- >> point. Even the characters from 128 to 255 are fixed to Latin-1. >> > > A locale is not an encoding; nor is it a codepage. > > A locale is a set of formatting and language-specific rules to apply. > > Which, in some past contexts, may have been associated with the usage > of specific code pages, but codepages are N/A with Unicode. Even as > such, various language specific rules may still exist. > > For things like case-folding, you may still need to care about which > language (AKA, locale) is in effect, as some conversions may apply to > some languages but not others. > > Some letters case-map differently depending on the language, ligatures > may be in effect (which may compose/decompose or map to other > ligatures), etc. > > > Or, one just throws a lot of this out and uses a simplified set of > "mostly language neutral" rules. > > Say, case conversion maps: > Upper: 0061..007A -> 0041..005A > Lower: 0041..005A -> 0061..007A > Upper: 00E0..00FE -> 00C0..00DE > Lower: 00C0..00DE -> 00E0..00FE > ... (Add a few more, for Greek / Cyrillic / etc) > > And, maybe a few special cases, say (*): > 009A <-> 008A > 009C <-> 008C > 009E <-> 008E > 00FF <-> 009F > *: Assuming the "1252 mappings in Unicode Space replacing C1 controls" wonk. > > Probably ignore most everything else, it passes through as-is. There are a number of existing filesystems that are case-insensitive (and mostly case-preserving): FAT, NFTS, ext4 with certain options, etc. Presumably all of these already have established rules for case mapping, determining whether two given characters like 'a' and 'A' are to be treated as the "same". (I don't happen to know what those rules are or how they differ from one filesystem type to another.) Why are you trying to invent yet another set of rules? https://xkcd.com/927/ -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com void Void(void) { Void(); } /* The recursive call of the void */