Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #77629 > unrolled thread

unicode is a fail

Started byfir <profesor.fir@gmail.com>
First post2015-12-02 08:01 -0800
Last post2015-12-06 13:45 +0000
Articles 20 on this page of 158 — 25 participants

Back to article view | Back to comp.lang.c


Contents

  unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 08:01 -0800
    Re: unicode is a fail me <self@example.org> - 2015-12-02 16:12 +0000
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:09 -0800
    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 08:18 -0800
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:07 -0800
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:21 -0600
          Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:40 -0800
          Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:22 -0800
            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:59 -0600
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:25 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 19:47 -0600
            Re: unicode is a fail supercat@casperkitty.com - 2015-12-02 14:38 -0800
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:26 -0800
                Re: unicode is a fail Tim Rentsch <txr@alumni.caltech.edu> - 2015-12-09 11:33 -0800
                  Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:21 -0800
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 11:28 +0100
            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 08:50 -0600
              Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:38 +0100
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 10:01 -0600
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-03 09:46 -0800
              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:39 +0000
            Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 08:26 -0800
              Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 18:42 +0000
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 17:14 -0800
                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 19:02 -0800
                  Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-04 06:35 +0000
                    Re: unicode is a fail David Thompson <dave.thompson2@verizon.net> - 2015-12-28 05:11 -0500
                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 10:24 -0600
              Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 22:37 +0000
                Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 11:32 +0100
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:10 -0600
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:24 -0800
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:10 -0600
            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-02 19:45 +0000
              Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-03 09:08 +1300
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 14:10 -0600
        Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:27 -0800
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:21 -0600
            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 15:18 -0800
              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:45 +0000
      Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 09:43 -0800
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 11:40 -0800
          Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 12:19 -0800
        Re: unicode is a fail Nobody <nobody@nowhere.invalid> - 2015-12-02 21:23 +0000
      Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 10:12 +0100
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 02:13 -0800
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:11 +0100
            Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 05:17 -0800
              Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 15:33 +0100
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:05 -0800
                  Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:42 +0100
                    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:58 -0800
        Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 10:38 +0000
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:17 +0100
        Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:54 +0000
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 14:25 +0100
            Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-04 13:46 +0000
    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-02 23:24 +0000
      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-03 00:45 +0000
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 20:59 -0600
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 19:13 -0800
        Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-03 07:00 +0000
          Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 04:45 -0800
            Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:04 +0000
          Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 13:22 +0000
            Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 07:35 -0800
            Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 19:17 +0000
              Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 11:49 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 15:39 -0600
                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 14:19 -0800
                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 12:57 -0600
                      Re: unicode is a fail supercat@casperkitty.com - 2015-12-06 15:47 -0800
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:13 +0000
                  Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-05 01:59 +0000
                    Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-05 17:17 +0100
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:28 +0000
              Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 23:46 +0000
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:04 +0000
                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 03:21 -0800
                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:03 -0600
                  Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 11:47 +0000
                    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 04:40 -0800
                      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 13:26 +0000
                        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:35 -0600
                          Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-06 02:23 +0000
                            Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 16:09 +0530
                      Re: unicode is a fail Xavier <zaz.colmant@free.fr> - 2015-12-05 15:45 +0100
                        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 07:42 -0800
                    Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-05 16:32 -0800
                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 18:11 -0800
                      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 02:19 +0000
                        Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 13:09 +0000
                          Re: unicode is a fail Martin Shobe <martin.shobe@yahoo.com> - 2015-12-06 18:38 -0600
                            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 01:55 +0000
                              Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 19:14 -0800
                                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 13:53 +0000
                                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 06:31 -0800
                                    Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 21:22 +0000
                                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 15:34 -0600
                                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 16:36 -0800
                                      Re: unicode is a fail Lowell Gilbert <lgusenet@be-well.ilk.org> - 2015-12-08 11:40 -0500
                                        Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-08 17:18 +0000
                                          Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-09 08:36 -0600
                                            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 10:06 -0600
                                            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 09:35 -0800
                                              Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 10:07 -0800
                                                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:04 -0800
                                                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 12:35 -0800
                                                    Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-09 23:46 +0000
                                                      Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 16:15 -0800
                                                        Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-10 03:49 +0000
                                                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 18:12 -0600
                                              Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-09 13:12 -0500
                                                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:12 -0800
                                              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-10 20:48 +0000
                                            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-09 23:44 +0000
                                              Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 01:13 -0600
                                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-10 10:39 +0000
                                                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-10 03:33 -0800
                                                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-10 06:07 -0800
                                                  Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-10 08:21 -0600
                                            Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 00:59 -0600
                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 14:33 +0000
                              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 22:45 -0600
                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 12:38 +0000
                                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 13:55 -0600
                                    Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 21:14 +0000
                                      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 16:50 -0600
                              Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-07 02:38 -0600
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 07:34 +0000
                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 00:24 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 19:49 -0600
              Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 21:32 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:50 -0800
                  Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 22:15 +0000
                    Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 17:27 -0500
                      Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:06 +0000
                        Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 18:29 -0500
                          Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:50 +0000
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:38 +0000
                      Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:33 +0000
                Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 16:51 -0500
                Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 10:59 +1300
                  Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 11:00 +1300
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:31 +0000
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 17:48 -0800
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 01:20 -0800
          Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 02:02 -0800
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 09:43 -0600
      Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:55 +0000
        Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:29 +0000
          Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 16:42 +0000
      Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 10:06 +0000
        OT: Usenet (Was: unicode is a fail) Steve Thompson <stevet810@gmail.com> - 2015-12-05 20:41 +0000
          Re: OT: Usenet (Was: unicode is a fail) Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:18 -0800
        Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 10:21 +0530
          OT: Facebook (was Re: unicode is a fail) Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-06 08:51 +0000
            Re: OT: Facebook (was Re: unicode is a fail) raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:45 +0000

Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8  Next page →


#77879

FromBartC <bc@freeuk.com>
Date2015-12-05 11:47 +0000
Message-ID<n3uiop$p98$1@dont-email.me>
In reply to#77867
On 05/12/2015 01:04, Steve Thompson wrote:
> On Fri, Dec 04, 2015 at 11:46:52PM +0000, BartC wrote:

>> Fine, then we move to 16 bits, which had long been anticipated anyway,
>> and gives us plenty of room for special symbols. But not if we have to
>> throw in every single alphabet and writing system that anybody has ever
>> heard of (and apparently plenty that no one has heard of!).
>
> I rather suspect the Anthropologists will scream bloody murder if
> Egyptian hieroglyphics, Linear B, and all the rest are excluded.

They probably wouldn't notice. Whatever software they use to enter and 
display the characters would still work if a different encoding scheme 
was used.

Or many might prefer just using mark-up to describe it: 
{snake}{bird}{water}.

>> (And then you have vast, sprawling 'alphabets' like Chinese which are
>> words rather than the letters used to build the words.)
>
> So go tell the Chinese (and Japanese, and Thais, and ...) that they
> should man-up and use a Western alphabet.  Such schemes exist, after
> all.

No, they can use the same alphabets, but they don't put them all into 
one giant melting pot with every other.

Now, I can now longer write what had been trivial string handling 
routines such as capitalise, toupper, reverse, compare, left, leftn, 
etc etc. All are very well defined in ASCII, but would no longer be 
guaranteed to work with Unicode because most of the alphabets are so weird.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#77881

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-05 04:40 -0800
Message-ID<b88ee903-17ed-4f0d-8ebc-308a22fd4de8@googlegroups.com>
In reply to#77879
On Saturday, December 5, 2015 at 11:48:12 AM UTC, Bart wrote:
> 
> Now, I can now longer write what had been trivial string handling 
> routines such as capitalise, toupper, reverse, compare, left, leftn, 
> etc etc. All are very well defined in ASCII, but would no longer be 
> guaranteed to work with Unicode because most of the alphabets are so weird.
> 
The concept of capitals is also pretty weird. We accept it normal
because we grew up with it.
It's just reality. Some operations either won't make sense or
will be problematic if you don't use English or a closely related
language. E.g. a French capital E cannot take an acute accent.
So if we capitalise touché (a favourite French word) we get
TOUCHE, and we put it back we get touche, touch, which doesn't
mean the same thing - the reversibility rule which holds in 
English is broken. But that's a characteristic of French, not
a problem created by Unicode.

[toc] | [prev] | [next] | [standalone]


#77882

FromBartC <bc@freeuk.com>
Date2015-12-05 13:26 +0000
Message-ID<n3uoh3$e71$1@dont-email.me>
In reply to#77881
On 05/12/2015 12:40, Malcolm McLean wrote:
> On Saturday, December 5, 2015 at 11:48:12 AM UTC, Bart wrote:
>>
>> Now, I can now longer write what had been trivial string handling
>> routines such as capitalise, toupper, reverse, compare, left, leftn,
>> etc etc. All are very well defined in ASCII, but would no longer be
>> guaranteed to work with Unicode because most of the alphabets are so weird.
>>
> The concept of capitals is also pretty weird. We accept it normal
> because we grew up with it.
> It's just reality. Some operations either won't make sense or
> will be problematic if you don't use English or a closely related
> language. E.g. a French capital E cannot take an acute accent.
> So if we capitalise touché (a favourite French word) we get
> TOUCHE, and we put it back we get touche, touch, which doesn't
> mean the same thing - the reversibility rule which holds in
> English is broken. But that's a characteristic of French, not
> a problem created by Unicode.
>

But an accented E exists (É). Would TOUCHÉ be meaningless to a French 
speaker?

I'm mainly familiar with Italian where accents are used to indicate 
stress if it deviates from the rules, but it appears to be optional.

Stress can also be significant in English (PROject, proJECT), but we 
seem to manage without marking the difference, and the two versions of 
'project' will always match.

I just want to be able to write code like this without worrying about 
all the murky areas of Unicode (scripting code not C):

forall w in words do
     if reverse(w)=w then
         println w,"is a palindrome"

     fi
od

Here the set of words is a list of about 100,000 English language words, 
stored in lower case.

How would this be changed to accommodate Unicode? Well, reverse() would 
be useless for a start because of the many special cases (characters 
changing depending on their position in a word for example).

Then, the dictionary of words would surely have to be upgraded to 
include all the words of every language in the world. Why not? We 
already have a giant character set of all the world's alphabets. Unicode 
makes it harder to impose boundaries.

Further, the concept of a palindrome itself is probably meaningless with 
most languages.

So, this little program either has to get unfeasibly complicated to meet 
expectations, or it's argued out of existence altogether!

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#77910

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-05 13:35 -0600
Message-ID<n3ve4o$gi7$1@dont-email.me>
In reply to#77882
On 05-Dec-15 07:26, BartC wrote:
> On 05/12/2015 12:40, Malcolm McLean wrote:
>> The concept of capitals is also pretty weird. We accept it normal 
>> because we grew up with it. It's just reality. Some operations
>> either won't make sense or will be problematic if you don't use
>> English or a closely related language. E.g. a French capital E
>> cannot take an acute accent. So if we capitalise touché (a
>> favourite French word) we get TOUCHE, and we put it back we get
>> touche, touch, which doesn't mean the same thing - the
>> reversibility rule which holds in English is broken. But that's a
>> characteristic of French, not a problem created by Unicode.
> 
> But an accented E exists (É). Would TOUCHÉ be meaningless to a
> French speaker?

Not using accents on upper case letters became _tolerated_ during the
typewriter era because the dead-key trick for lower case letters would
put an accent _inside_ a capital letter, but that was never "correct",
and it's less tolerated now that computers can easily get it right.

> I'm mainly familiar with Italian where accents are used to indicate 
> stress if it deviates from the rules, but it appears to be optional.

Spanish does the same, but they're not optional; "esta" and "está" are
two very different words, for instance.

French doesn't have stressed syllables; instead, accents are used to
change the _sound_ of vowels.  It's rare for words to differ only in
accents, but when they do, a fluent speaker can easily figure out the
right word from the context, so missing accents aren't fatal.

> Further, the concept of a palindrome itself is probably meaningless
> with most languages.

Well, it probably makes sense for alphabets, abjads and abugidas, but
probably not for syllabaries or logographies.

> So, this little program either has to get unfeasibly complicated to
> meet expectations, or it's argued out of existence altogether!

Or you just restrict the problem domain that you're addressing.

Most calculators only work with Arabic numerals, not Roman ones, and I
don't see anyone complaining about that.  The latter is, at most, an
interesting exercise for Programming 101 courses.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77937

Fromglen herrmannsfeldt <gah@ugcs.caltech.edu>
Date2015-12-06 02:23 +0000
Message-ID<n4067j$r68$1@speranza.aioe.org>
In reply to#77910
Stephen Sprunk <stephen@sprunk.org> wrote:

(snip)

> French doesn't have stressed syllables; instead, accents are used to
> change the _sound_ of vowels.  It's rare for words to differ only in
> accents, but when they do, a fluent speaker can easily figure out the
> right word from the context, so missing accents aren't fatal.

Are you sure that there are no fatal mispronounciations?

Some years ago, the New York Times did a correction of mistaking
poisonous snack for posionous snake (or the other way around).
At the same time, I knew someone who spoke English such that
I couldn't tell which one she was saying.  It might not be
good to get those wrong.

-- glen

[toc] | [prev] | [next] | [standalone]


#77957

FromUdyant Wig <udyantw@gmail.com>
Date2015-12-06 16:09 +0530
Message-ID<87610bx3na.fsf@rudiments.goosenet.in>
In reply to#77937
glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:
> Some years ago, the New York Times did a correction of mistaking
> poisonous snack for posionous snake (or the other way around).  At the
> same time, I knew someone who spoke English such that I couldn't tell
> which one she was saying.  It might not be good to get those wrong.

  When this happens, most times it is hilarious.  At other times, not so
  much.
  
> -- glen

-- 
Udyant Wig

[toc] | [prev] | [next] | [standalone]


#77885

FromXavier <zaz.colmant@free.fr>
Date2015-12-05 15:45 +0100
Message-ID<alpine.LNX.2.20.1512051532030.22558@cruxy2.freebox.fr>
In reply to#77881

[Multipart message — attachments visible in raw view] — view raw

On Sat, 5 Dec 2015, Malcolm McLean wrote:

> The concept of capitals is also pretty weird. We accept it normal
> because we grew up with it.
> It's just reality. Some operations either won't make sense or
> will be problematic if you don't use English or a closely related
> language. E.g. a French capital E cannot take an acute accent.
> So if we capitalise touché (a favourite French word) we get
> TOUCHE, and we put it back we get touche, touch, which doesn't
> mean the same thing - the reversibility rule which holds in
> English is broken. But that's a characteristic of French, not
> a problem created by Unicode.
>
In French, you have to put accented letters on capital letters,
dixit l'Académie française, « Le bon usage » from Grevisse, etc.

Many French people are confused about this issue. It's probably
due to the fact that typewriters couldn't properly do it properly.

À bientôt,

Xavier

[toc] | [prev] | [next] | [standalone]


#77890

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-05 07:42 -0800
Message-ID<bd9f56b1-a8fc-4faa-9ebe-36d0fb8b2213@googlegroups.com>
In reply to#77885
On Saturday, December 5, 2015 at 2:45:49 PM UTC, Xavier wrote:
> 
> In French, you have to put accented letters on capital letters,
> dixit l'Académie française, « Le bon usage » from Grevisse, etc.
> 
> Many French people are confused about this issue. It's probably
> due to the fact that typewriters couldn't properly do it properly.
> 
I was taught that a capital may not take an accent, in French.

Of course that was only a schoolmaster's view. The academy may
hold differently. so I stand corrected on that. 

[toc] | [prev] | [next] | [standalone]


#77929

FromKeith Thompson <kst-u@mib.org>
Date2015-12-05 16:32 -0800
Message-ID<lnwpsspgbt.fsf@kst-u.example.com>
In reply to#77879
BartC <bc@freeuk.com> writes:
> On 05/12/2015 01:04, Steve Thompson wrote:
>> On Fri, Dec 04, 2015 at 11:46:52PM +0000, BartC wrote:
[...]
>>> (And then you have vast, sprawling 'alphabets' like Chinese which are
>>> words rather than the letters used to build the words.)
>>
>> So go tell the Chinese (and Japanese, and Thais, and ...) that they
>> should man-up and use a Western alphabet.  Such schemes exist, after
>> all.
>
> No, they can use the same alphabets, but they don't put them all into 
> one giant melting pot with every other.

So you want users of Asian writing systems to use their own separate
character set encodings, incompatible with the encodings used in
Western countries.

Because that way it's more convenient for you.

Sorry, but the decision has already been made.  Unicode combines
most of the world's character sets into a single standard, and that's
not going to change.  Complain all you like (preferably elsewhere);
it's not going to make any difference.

No doubt you have some ideas for how HTML web pages can include
both ASCII-encoded tag names and Chinese characters.  Which means
there has to be a way to combine Latin and Chinese characters in
a single document anyway.

> Now, I can now longer write what had been trivial string handling 
> routines such as capitalise, toupper, reverse, compare, left, leftn, 
> etc etc. All are very well defined in ASCII, but would no longer be 
> guaranteed to work with Unicode because most of the alphabets are so weird.

Too bad.  The "giant melting pot" you worry about already exists, and is
used for most text transmitted over the Internet.

If you want to write software that only deals with ASCII, you're
absolutely free to do so, and you can do as much trivial string
handling as you like.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#77934

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-05 18:11 -0800
Message-ID<09a2c396-9ccf-4385-8712-89e80cef6cef@googlegroups.com>
In reply to#77929
On Sunday, December 6, 2015 at 12:32:39 AM UTC, Keith Thompson wrote:
> 
> No doubt you have some ideas for how HTML web pages can include
> both ASCII-encoded tag names and Chinese characters.  Which means
> there has to be a way to combine Latin and Chinese characters in
> a single document anyway.
> 
In fact mixed text is only going to get more common. There's a
massive move to learn English in places like China and Eastern
Europe, and you see things like Manchester United football shirts
with Chinese adverts on them.
 

[toc] | [prev] | [next] | [standalone]


#77935

FromBartC <bc@freeuk.com>
Date2015-12-06 02:19 +0000
Message-ID<n405rh$8nu$1@dont-email.me>
In reply to#77929
On 06/12/2015 00:32, Keith Thompson wrote:
> BartC <bc@freeuk.com> writes:

>> No, they can use the same alphabets, but they don't put them all into
>> one giant melting pot with every other.
>
> So you want users of Asian writing systems to use their own separate
> character set encodings,

Well, they'd have the advantage of starting from code-point 0!

Imagine what we'd think about ASCII (an offset version) starting at 
code-point 0x27F80 in some supplementary plane.

> Because that way it's more convenient for you.

Maybe to others too. (What's next for the Unicode architects, to combine 
all the programming languages of the world into one giant syntax? What 
could possibly go wrong?!)

> No doubt you have some ideas for how HTML web pages can include
> both ASCII-encoded tag names and Chinese characters.  Which means
> there has to be a way to combine Latin and Chinese characters in
> a single document anyway.

HTML pages can include all sorts of junk, of which the character 
encoding scheme, when you can locate any actual text, might be a small part.

>> Now, I can now longer write what had been trivial string handling
>> routines such as capitalise, toupper, reverse, compare, left, leftn,
>> etc etc. All are very well defined in ASCII, but would no longer be
>> guaranteed to work with Unicode because most of the alphabets are so weird.
>
> Too bad.  The "giant melting pot" you worry about already exists, and is
> used for most text transmitted over the Internet.
>
> If you want to write software that only deals with ASCII, you're
> absolutely free to do so, and you can do as much trivial string
> handling as you like.

Yes, I can, and I can have my own scheme for dealing with the extra 
characters I need. Then it will need conversions for interacting with 
anything else. (But most likely I will end up using a 16-bit scheme that 
can represent the BMP.)

I'm just saying it might have been a better idea for those large 
open-ended alphabets not to simply have been merged into and to have 
overwhelmed the set of compact alphabets.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#77977

FromBartC <bc@freeuk.com>
Date2015-12-06 13:09 +0000
Message-ID<n41btc$nbu$1@dont-email.me>
In reply to#77935
On 06/12/2015 02:19, BartC wrote:
> On 06/12/2015 00:32, Keith Thompson wrote:

>> If you want to write software that only deals with ASCII, you're
>> absolutely free to do so, and you can do as much trivial string
>> handling as you like.
>
> Yes, I can, and I can have my own scheme for dealing with the extra
> characters I need. Then it will need conversions for interacting with
> anything else. (But most likely I will end up using a 16-bit scheme that
> can represent the BMP.)

I spent 5 minutes thinking about an alternative to Unicode, and 10 
minutes writing up a first draft, and 10 more minutes for a second draft 
(I won't bore you with the details).

30 minutes to invent a new Unicode; it wasn't hard! (Of course, it might 
need tweaking in actual use...)

In 32-bit form, the two schemes (Unicode, and mine), aren't that 
different in that each character is allocated a dedicated code-point. 
But in mine, the large alphabets are tidily partitioned out of the way. 
A similar concept to code-pages, but 32K characters each and that can 
co-exist in the same text.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#78018

FromMartin Shobe <martin.shobe@yahoo.com>
Date2015-12-06 18:38 -0600
Message-ID<n42k9h$jp2$1@dont-email.me>
In reply to#77977
On 12/6/2015 7:09 AM, BartC wrote:
> On 06/12/2015 02:19, BartC wrote:
>> On 06/12/2015 00:32, Keith Thompson wrote:
>
>>> If you want to write software that only deals with ASCII, you're
>>> absolutely free to do so, and you can do as much trivial string
>>> handling as you like.
>>
>> Yes, I can, and I can have my own scheme for dealing with the extra
>> characters I need. Then it will need conversions for interacting with
>> anything else. (But most likely I will end up using a 16-bit scheme that
>> can represent the BMP.)
>
> I spent 5 minutes thinking about an alternative to Unicode, and 10
> minutes writing up a first draft, and 10 more minutes for a second draft
> (I won't bore you with the details).
>
> 30 minutes to invent a new Unicode; it wasn't hard! (Of course, it might
> need tweaking in actual use...)
>
> In 32-bit form, the two schemes (Unicode, and mine), aren't that
> different in that each character is allocated a dedicated code-point.
> But in mine, the large alphabets are tidily partitioned out of the way.
> A similar concept to code-pages, but 32K characters each and that can
> co-exist in the same text.
>

Can you give a link to it?

Martin Shobe

[toc] | [prev] | [next] | [standalone]


#78023

FromBartC <bc@freeuk.com>
Date2015-12-07 01:55 +0000
Message-ID<n42or0$op$1@dont-email.me>
In reply to#78018
On 07/12/2015 00:38, Martin Shobe wrote:
> On 12/6/2015 7:09 AM, BartC wrote:

>> I spent 5 minutes thinking about an alternative to Unicode, and 10
>> minutes writing up a first draft, and 10 more minutes for a second draft
>> (I won't bore you with the details).

>> In 32-bit form, the two schemes (Unicode, and mine), aren't that
>> different in that each character is allocated a dedicated code-point.
>> But in mine, the large alphabets are tidily partitioned out of the way.
>> A similar concept to code-pages, but 32K characters each and that can
>> co-exist in the same text.

> Can you give a link to it?

It was only a dozen or so lines of text!

Anyway I thought about it for another ten or twenty minutes and I have a 
revised scheme (the previous one included non-character escape codes 
within a string which I didn't like). Here's version 3:

* In-memory representation, 32-bit version

* All large alphabets are organised into sets of 64K characters, each is 
given an alphabet code (similar to a code-page, but bigger)

* ASCII, small alphabets and symbols fit into a single special alphabet 
of 64K characters, and itself has an alphabet code of zero

* Local character encodings for each alphabet are from 0 to 65535, which 
form the lsw of the 32-bit code.

* The msw of the 32-bit code is the alphabet code. The complete code 
forms a unique identifier for the character (ignoring the possibilities 
of duplicates). The set of all character codes is sparse (not all 
alphabets will occupy 64K slots)

* Where one only alphabet is known to be in use (alphabet 0 also counts 
as just one), then a 16-bit in-memory encoding can be used. (With a 
similar trick for 8-bit encoding when all character codes are 0 to 255.)

* (This can also be done on a per-string basic, with the alphabet in use 
being an attribute associated with the string.)

* (Possibly, the first 256 codes of alphabet 0, which are really general 
purpose characters, could be repeated at the start of all alphabets. But 
this creates the problem of multiple encodings of these characters.)

-- 
Bartc


[toc] | [prev] | [next] | [standalone]


#78028

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-06 19:14 -0800
Message-ID<77d7b808-27fc-48aa-b24f-53f9636a6634@googlegroups.com>
In reply to#78023
On Monday, December 7, 2015 at 1:56:13 AM UTC, Bart wrote:
> On 07/12/2015 00:38, Martin Shobe wrote:
> > On 12/6/2015 7:09 AM, BartC wrote:
> 
> >> I spent 5 minutes thinking about an alternative to Unicode, and 10
> >> minutes writing up a first draft, and 10 more minutes for a second draft
> >> (I won't bore you with the details).
> 
> >> In 32-bit form, the two schemes (Unicode, and mine), aren't that
> >> different in that each character is allocated a dedicated code-point.
> >> But in mine, the large alphabets are tidily partitioned out of the way.
> >> A similar concept to code-pages, but 32K characters each and that can
> >> co-exist in the same text.
> 
> > Can you give a link to it?
> 
> It was only a dozen or so lines of text!
> 
> Anyway I thought about it for another ten or twenty minutes and I have a 
> revised scheme (the previous one included non-character escape codes 
> within a string which I didn't like). Here's version 3:
> 
You've got to consider the users.
For simple English text you need ascii. The rest of Western Europe 
uses extended Latin, and annoyingly it won't quite fit into 8 bits.
Eastern Europe uses Greek characters. Complex English text includes
ascii, extended Latin, and Greek, and a few special symbols not
included in ascii. At that point, we start to have the issue of
what is markup and what is content. Is 1/2 the same content as 
a half symbol? 
You don't usually see Hebrew or Arabic in English texts, unless
they are specifically dealing with Hebrew or Arabic as their
subject, and it's not expected that the general reader will
recognise the symbols. They're also right to left, and Arabic is
cursive. However they have small alphabets. They also have markup
systems for the vowels.
Then you've got minority scripts with small alphabets, and the 
Far Eastern languages with massive character sets, and the Indian
languages. Again, virtually all of the symbols are meaningless
to the average English reader, but it's not usually true the
other way round - Far Eastern and Indian readers are likely to
know the English characters and embed English text in their
documents.
Finally you've got marginal scripts which are very much special
purpose, like Linear B or Klingon. The former is serious but
not used for communication, only for representing a tiny corpus
of archaeologically recovered literature, and the latter is really
just for demonstrating the universality of the encoding system.

Those are your levels of support, from an Anglo-centric 
perspective.

[toc] | [prev] | [next] | [standalone]


#78077

FromBen Bacarisse <ben.usenet@bsb.me.uk>
Date2015-12-07 13:53 +0000
Message-ID<87d1ui1i2i.fsf@bsb.me.uk>
In reply to#78028
Malcolm McLean <malcolm.mclean5@btinternet.com> writes:
<snip>
> You've got to consider the users.
> For simple English text you need ascii. The rest of Western Europe 
> uses extended Latin, and annoyingly it won't quite fit into 8 bits.
> Eastern Europe uses Greek characters. Complex English text includes
> ascii, extended Latin, and Greek, and a few special symbols not
> included in ascii.

You say "You've got to consider the users" but you are not considering
them.  You are classifying texts by language, not be what texts users
want to read or write.  Users in Western Europe often want to use
non-Latin scripts.

<snip>
-- 
Ben.

[toc] | [prev] | [next] | [standalone]


#78087

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-07 06:31 -0800
Message-ID<a540c923-a038-4c5a-815d-58004e3e2551@googlegroups.com>
In reply to#78077
On Monday, December 7, 2015 at 1:53:23 PM UTC, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.mclean5@btinternet.com> writes:
>
> You say "You've got to consider the users" but you are not considering
> them.  You are classifying texts by language, not be what texts users
> want to read or write.  Users in Western Europe often want to use
> non-Latin scripts.
> 
Only Greek, and in the special case where the non-Latin script language
or text is itself the subject of the material.
We don't embed Hebrew or Arabic in normal English texts, as the general
reader is not considered to be sufficiently familiar with the symbols.
The sole exception is the mathematical aleph for infinity. Greek has also
died out, except for mathematical use, but in older writing aimed at
an educated readership, you do often see Greek words. It was assumed
that a gentleman would have been taught Greek at school and it was
insulting to translate. (That's actually still style guide for Oxford examination
scripts, candidates are told to keep foreign language quotations in the
original).

[toc] | [prev] | [next] | [standalone]


#78137

FromBen Bacarisse <ben.usenet@bsb.me.uk>
Date2015-12-07 21:22 +0000
Message-ID<87a8pm9cnk.fsf@bsb.me.uk>
In reply to#78087
Malcolm McLean <malcolm.mclean5@btinternet.com> writes:

> On Monday, December 7, 2015 at 1:53:23 PM UTC, Ben Bacarisse wrote:
>> Malcolm McLean <malcolm.mclean5@btinternet.com> writes:
>>
>> You say "You've got to consider the users" but you are not considering
>> them.  You are classifying texts by language, not be what texts users
>> want to read or write.  Users in Western Europe often want to use
>> non-Latin scripts.
>> 
> Only Greek, and in the special case where the non-Latin script language
> or text is itself the subject of the material.

hwɒt ˈplanɪt ˈɑːreɪ juː ɒn?  My previous local authority frequently sent
me notices where at least one of the scripts was unknown to me, and the
university I worked at produced promotional materials in many scripts.
At least ¾ of them were marked ♼, and many—but not all—used special
punctuation (see pages 22–24).

Users in Western Europe do not always use only Latin scripts.

<snip>
-- 
Ben.

[toc] | [prev] | [next] | [standalone]


#78138

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-07 15:34 -0600
Message-ID<n44trl$lgd$1@dont-email.me>
In reply to#78087
On 07-Dec-15 08:31, Malcolm McLean wrote:
> Ben Bacarisse wrote:
>> You say "You've got to consider the users" but you are not
>> considering them.  You are classifying texts by language, not be
>> what texts users want to read or write.  Users in Western Europe
>> often want to use non-Latin scripts.
> 
> Only Greek, and in the special case where the non-Latin script 
> language or text is itself the subject of the material.

Western Europeans haven't discovered emojis yet?  They don't use
mathematical or scientific symbols?  There are no translators,
immigrants or diplomats who know a non-Latin language?  There are no
schools that teach non-Latin languages?

Western Europe sounds like a backward place compared to the Americas;
maybe the old joke about monolingual people needs to be revised.

> We don't embed Hebrew or Arabic in normal English texts, as the
> general reader is not considered to be sufficiently familiar with the
> symbols.

Are there no people in Western Europe who know Hebrew or Arabic?  I know
they slaughtered several million Jews and shipped millions more to the
US and Israel, so maybe there are none left, but I'm pretty sure I've
read complaints about Muslim immigrants in Western Europe, many of whom
presumably know Arabic, if only for reading the Quran.

> The sole exception is the mathematical aleph for infinity.

Different glyph, different code point, different bidi behavior.

> Greek has also died out, 

Outside of Greece, yes, aside from immigrants and such--and aside from
Greek script being on the Euro.

> except for mathematical use,

Same glyphs, different code points, different bidi behavior.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#78155

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-07 16:36 -0800
Message-ID<1649f717-fbd3-4043-beb5-9a23e38f1ab3@googlegroups.com>
In reply to#78138
On Monday, December 7, 2015 at 9:34:17 PM UTC, Stephen Sprunk wrote:
> On 07-Dec-15 08:31, Malcolm McLean wrote:
> > Ben Bacarisse wrote:
> >> You say "You've got to consider the users" but you are not
> >> considering them.  You are classifying texts by language, not be
> >> what texts users want to read or write.  Users in Western Europe
> >> often want to use non-Latin scripts.
> > 
> > Only Greek, and in the special case where the non-Latin script 
> > language or text is itself the subject of the material.
> 
> Western Europeans haven't discovered emojis yet?  They don't use
> mathematical or scientific symbols?  There are no translators,
> immigrants or diplomats who know a non-Latin language?  There are no
> schools that teach non-Latin languages?
> 
Obviously if you are publishing a primer on Mandarin for English
schools, you will need Chinese text. But that's the exceptional case -
the script or text itself is the subject of the material.
An emoji is a marginal case.
>
> Are there no people in Western Europe who know Hebrew or Arabic?  I know
> they slaughtered several million Jews and shipped millions more to the
> US and Israel, so maybe there are none left, but I'm pretty sure I've
> read complaints about Muslim immigrants in Western Europe, many of whom
> presumably know Arabic, if only for reading the Quran.
>
Hebrew words like "kosher" have entered English, together with much
older words like "allelujah" or "David". But they never appear in
Hebrew script. Similarly Arabic words like "fatwah" or "mujihadeen".
That's not true of Greek words. Modern authors will tend to
transliterate, (e.g. hubris) but Victorian authors often kept them 
in Greek script.

The exception, again, is when a Biblical text is itself is the
subject of the English work. So a scholarly or Jewish text in
English, on the subject of the Hebrew Bible, may contain embedded
Hebrew script.

 

[toc] | [prev] | [next] | [standalone]


Page 5 of 8 — ← Prev page 1 2 3 4 [5] 6 7 8  Next page →

Back to top | Article view | comp.lang.c


csiph-web