Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Steve Thompson Newsgroups: comp.lang.c Subject: Re: unicode is a fail Date: Sun, 06 Dec 2015 07:34:15 +0000 Organization: Friends of the Galactic Collective Lines: 90 Message-ID: References: <2qyvC0.96Q.SQT8q@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Injection-Info: mx02.eternal-september.org; posting-host="db6a325a180952510645519e280bd6db"; logging-data="24508"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/vqG3yWfJ30S8PyHcP6e6Q" User-Agent: NN 6.4.3 Cancel-Lock: sha1:fFFztTgXcXe5RdYIMCxZf3vnfCg= Xref: csiph.com comp.lang.c:77944 On Sat, Dec 05, 2015 at 11:47:45AM +0000, BartC wrote: > On 05/12/2015 01:04, Steve Thompson wrote: > >On Fri, Dec 04, 2015 at 11:46:52PM +0000, BartC wrote: > > >>Fine, then we move to 16 bits, which had long been anticipated anyway, > >>and gives us plenty of room for special symbols. But not if we have to > >>throw in every single alphabet and writing system that anybody has ever > >>heard of (and apparently plenty that no one has heard of!). > > > >I rather suspect the Anthropologists will scream bloody murder if > >Egyptian hieroglyphics, Linear B, and all the rest are excluded. > > They probably wouldn't notice. Whatever software they use to enter and > display the characters would still work if a different encoding scheme > was used. > > Or many might prefer just using mark-up to describe it: > {snake}{bird}{water}. It seems to me that the code positions for those two languages are already assigned. > >>(And then you have vast, sprawling 'alphabets' like Chinese which are > >>words rather than the letters used to build the words.) > > > >So go tell the Chinese (and Japanese, and Thais, and ...) that they > >should man-up and use a Western alphabet. Such schemes exist, after > >all. > > No, they can use the same alphabets, but they don't put them all into > one giant melting pot with every other. > > Now, I can now longer write what had been trivial string handling > routines such as capitalise, toupper, reverse, compare, left, leftn, > etc etc. All are very well defined in ASCII, but would no longer be > guaranteed to work with Unicode because most of the alphabets are so weird. I'm not sure what to say. As others have pointed out (or suggested) the complexity of language conventions is a product of undirected evolution throughout history. It may be a mess, but nevertheless it has to be dealt with. Sorting in particular is a problem if one requires case insensitivity. I suppose the only solution is a good set of per-language tables which can be put in arrays for quick access. The combining characters are another problem. >From the "unicode" man-page on my system: Implementation Levels As not all systems are expected to support advanced mechanisms like combining characters, ISO 10646-1 specifies the following three implementation levels of UCS: Level 1 Combining characters and Hangul Jamo (a variant encoding of the Korean script, where a Hangul syllable glyph is coded as a triplet or pair of vovel/consonant codes) are not supported. Level 2 In addition to level 1, combining characters are now allowed for some languages where they are essential (e.g., Thai, Lao, Hebrew, Arabic, Devanagari, Malayalam). Level 3 All UCS characters are supported. The Unicode 3.0 Standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane at implementation level 3, as described in ISO 10646-1:2000. Unicode 3.1 added the supplemental planes of ISO 10646-2. The Unicode standard and technical reports published by the Unicode Consortium provide much additional information on the semantics and recommended usages of various characters. They provide guidelines and algorithms for editing, sorting, comparing, normalizing, converting and displaying Unicode strings. I wonder what their algorithm hints are. Unfortunately something I just don't have time to treat in depth at the moment. Regards, Steve Thompson -- "If I had a nickel for every time some idiot called me about a computer problem that turned out to be user error, I would be able to retire and spend the rest of my days cultivating clues in my backyard hillside garden." -- MysteryDog in 24hoursupport.helpdesk.