Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #77629 > unrolled thread

unicode is a fail

Started byfir <profesor.fir@gmail.com>
First post2015-12-02 08:01 -0800
Last post2015-12-06 13:45 +0000
Articles 20 on this page of 158 — 25 participants

Back to article view | Back to comp.lang.c


Contents

  unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 08:01 -0800
    Re: unicode is a fail me <self@example.org> - 2015-12-02 16:12 +0000
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:09 -0800
    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 08:18 -0800
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:07 -0800
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:21 -0600
          Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:40 -0800
          Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:22 -0800
            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:59 -0600
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:25 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 19:47 -0600
            Re: unicode is a fail supercat@casperkitty.com - 2015-12-02 14:38 -0800
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:26 -0800
                Re: unicode is a fail Tim Rentsch <txr@alumni.caltech.edu> - 2015-12-09 11:33 -0800
                  Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:21 -0800
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 11:28 +0100
            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 08:50 -0600
              Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:38 +0100
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 10:01 -0600
              Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-03 09:46 -0800
              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:39 +0000
            Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 08:26 -0800
              Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 18:42 +0000
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 17:14 -0800
                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 19:02 -0800
                  Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-04 06:35 +0000
                    Re: unicode is a fail David Thompson <dave.thompson2@verizon.net> - 2015-12-28 05:11 -0500
                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 10:24 -0600
              Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 22:37 +0000
                Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 11:32 +0100
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:10 -0600
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:24 -0800
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:10 -0600
            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-02 19:45 +0000
              Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-03 09:08 +1300
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 14:10 -0600
        Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:27 -0800
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:21 -0600
            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 15:18 -0800
              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:45 +0000
      Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 09:43 -0800
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 11:40 -0800
          Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 12:19 -0800
        Re: unicode is a fail Nobody <nobody@nowhere.invalid> - 2015-12-02 21:23 +0000
      Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 10:12 +0100
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 02:13 -0800
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:11 +0100
            Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 05:17 -0800
              Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 15:33 +0100
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:05 -0800
                  Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:42 +0100
                    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:58 -0800
        Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 10:38 +0000
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:17 +0100
        Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:54 +0000
          Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 14:25 +0100
            Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-04 13:46 +0000
    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-02 23:24 +0000
      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-03 00:45 +0000
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 20:59 -0600
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 19:13 -0800
        Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-03 07:00 +0000
          Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 04:45 -0800
            Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:04 +0000
          Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 13:22 +0000
            Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 07:35 -0800
            Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 19:17 +0000
              Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 11:49 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 15:39 -0600
                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 14:19 -0800
                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 12:57 -0600
                      Re: unicode is a fail supercat@casperkitty.com - 2015-12-06 15:47 -0800
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:13 +0000
                  Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-05 01:59 +0000
                    Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-05 17:17 +0100
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:28 +0000
              Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 23:46 +0000
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:04 +0000
                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 03:21 -0800
                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:03 -0600
                  Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 11:47 +0000
                    Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 04:40 -0800
                      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 13:26 +0000
                        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:35 -0600
                          Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-06 02:23 +0000
                            Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 16:09 +0530
                      Re: unicode is a fail Xavier <zaz.colmant@free.fr> - 2015-12-05 15:45 +0100
                        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 07:42 -0800
                    Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-05 16:32 -0800
                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 18:11 -0800
                      Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 02:19 +0000
                        Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 13:09 +0000
                          Re: unicode is a fail Martin Shobe <martin.shobe@yahoo.com> - 2015-12-06 18:38 -0600
                            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 01:55 +0000
                              Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 19:14 -0800
                                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 13:53 +0000
                                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 06:31 -0800
                                    Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 21:22 +0000
                                    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 15:34 -0600
                                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 16:36 -0800
                                      Re: unicode is a fail Lowell Gilbert <lgusenet@be-well.ilk.org> - 2015-12-08 11:40 -0500
                                        Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-08 17:18 +0000
                                          Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-09 08:36 -0600
                                            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 10:06 -0600
                                            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 09:35 -0800
                                              Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 10:07 -0800
                                                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:04 -0800
                                                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 12:35 -0800
                                                    Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-09 23:46 +0000
                                                      Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 16:15 -0800
                                                        Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-10 03:49 +0000
                                                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 18:12 -0600
                                              Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-09 13:12 -0500
                                                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:12 -0800
                                              Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-10 20:48 +0000
                                            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-09 23:44 +0000
                                              Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 01:13 -0600
                                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-10 10:39 +0000
                                                  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-10 03:33 -0800
                                                  Re: unicode is a fail supercat@casperkitty.com - 2015-12-10 06:07 -0800
                                                  Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-10 08:21 -0600
                                            Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 00:59 -0600
                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 14:33 +0000
                              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 22:45 -0600
                                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 12:38 +0000
                                  Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 13:55 -0600
                                    Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 21:14 +0000
                                      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 16:50 -0600
                              Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-07 02:38 -0600
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 07:34 +0000
                      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 00:24 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 19:49 -0600
              Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 21:32 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:50 -0800
                  Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 22:15 +0000
                    Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 17:27 -0500
                      Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:06 +0000
                        Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 18:29 -0500
                          Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:50 +0000
                    Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:38 +0000
                      Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:33 +0000
                Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 16:51 -0500
                Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 10:59 +1300
                  Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 11:00 +1300
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:31 +0000
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 17:48 -0800
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 01:20 -0800
          Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 02:02 -0800
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 09:43 -0600
      Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:55 +0000
        Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:29 +0000
          Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 16:42 +0000
      Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 10:06 +0000
        OT: Usenet (Was: unicode is a fail) Steve Thompson <stevet810@gmail.com> - 2015-12-05 20:41 +0000
          Re: OT: Usenet (Was: unicode is a fail) Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:18 -0800
        Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 10:21 +0530
          OT: Facebook (was Re: unicode is a fail) Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-06 08:51 +0000
            Re: OT: Facebook (was Re: unicode is a fail) raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:45 +0000

Page 1 of 8  [1] 2 3 4 5 6 7 8  Next page →


#77629 — unicode is a fail

Fromfir <profesor.fir@gmail.com>
Date2015-12-02 08:01 -0800
Subjectunicode is a fail
Message-ID<fbcae10f-7fc6-4a1e-90d7-ea4925016e47@googlegroups.com>
Im personally still using asci in all my private apps and i shiver (a bit) to use unicode as i read from time to time text that says unicode is a pain (at least in some situations) 

This directs me to think that unicode is in general a fail.. Unicode could go the way and become something maybe even simpler than ascii but gone a bit in a wrong way of making a lot additional mess 

I thing then that maybe one posible recovery scenerio is to use damn utf-32 only, everywhere you coud and try to forget and deprecate the other part of the mess 

what do ya think? 

[toc] | [next] | [standalone]


#77630

Fromme <self@example.org>
Date2015-12-02 16:12 +0000
Message-ID<n3n5ab$6g3$1@speranza.aioe.org>
In reply to#77629
On 2015-12-02, fir <profesor.fir@gmail.com> wrote:
> Im personally still using asci in all my private apps and i shiver (a
> bit) to use unicode as i read from time to time text that says unicode
> is a pain (at least in some situations) 

…good for you, pal.

> what do ya think? 

Nice trolling attempt.

[toc] | [prev] | [next] | [standalone]


#77637

Fromfir <profesor.fir@gmail.com>
Date2015-12-02 09:09 -0800
Message-ID<bdc0f137-e299-4eb6-96e7-4c9e5c8b9051@googlegroups.com>
In reply to#77630
W dniu środa, 2 grudnia 2015 17:13:18 UTC+1 użytkownik me napisał:
> On 2015-12-02, fir <profesor.fir@gmail.com> wrote:
> > Im personally still using asci in all my private apps and i shiver (a
> > bit) to use unicode as i read from time to time text that says unicode
> > is a pain (at least in some situations) 
> 
> …good for you, pal.
> 
> > what do ya think? 
> 
> Nice trolling attempt.

get wiser fella (dont say you want to achieve level of intelligence of well known non-trolls - as troll-insulters i assume try to presents themselves in general aura of their troll insulting stupidity here )

[toc] | [prev] | [next] | [standalone]


#77631

FromMalcolm McLean <malcolm.mclean5@btinternet.com>
Date2015-12-02 08:18 -0800
Message-ID<22fccd29-addc-4070-8d1d-c3f876f5f12e@googlegroups.com>
In reply to#77629
On Wednesday, December 2, 2015 at 4:02:14 PM UTC, fir wrote:
> Im personally still using asci in all my private apps and i shiver (a bit) to use unicode as i read 
> from time to time text that says unicode is a pain (at least in some situations) 
> 
> This directs me to think that unicode is in general a fail.. Unicode could go the way and 
> become something maybe even simpler than ascii but gone a bit in a wrong way of making
>  a lot additional mess 
> 
> I thing then that maybe one posible recovery scenerio is to use damn utf-32 only, everywhere
>  you coud and try to forget and deprecate the other part of the mess 
> 
> what do ya think?
>
If ascii had never achieved any traction outside of North America, then I think there would
be a strong case for UTF-32. Reality is that there are masses and masses of ascii interfaces
around, and it would be a nightmare job to track them all down and either rip them out
or write little adapter functions to make them talk to the rest of the world in UTF-32.

UTF-8 is the best compromise. But there are some problem that are very hard to avoid.,
like supporting archaic ash and thorn in English (mediaeval, ye olde coffee shoppe),
when half the population think the latter is a y as in yellow. 

[toc] | [prev] | [next] | [standalone]


#77636

Fromfir <profesor.fir@gmail.com>
Date2015-12-02 09:07 -0800
Message-ID<9d6e662f-e8eb-4f76-bc92-6d04d7b3eba0@googlegroups.com>
In reply to#77631
W dniu środa, 2 grudnia 2015 17:18:54 UTC+1 użytkownik Malcolm McLean napisał:
> On Wednesday, December 2, 2015 at 4:02:14 PM UTC, fir wrote:
> > Im personally still using asci in all my private apps and i shiver (a bit) to use unicode as i read 
> > from time to time text that says unicode is a pain (at least in some situations) 
> > 
> > This directs me to think that unicode is in general a fail.. Unicode could go the way and 
> > become something maybe even simpler than ascii but gone a bit in a wrong way of making
> >  a lot additional mess 
> > 
> > I thing then that maybe one posible recovery scenerio is to use damn utf-32 only, everywhere
> >  you coud and try to forget and deprecate the other part of the mess 
> > 
> > what do ya think?
> >
> If ascii had never achieved any traction outside of North America, then I think there would
> be a strong case for UTF-32. Reality is that there are masses and masses of ascii interfaces
> around, and it would be a nightmare job to track them all down and either rip them out
> or write little adapter functions to make them talk to the rest of the world in UTF-32.
> 
> UTF-8 is the best compromise. But there are some problem that are very hard to avoid.,
> like supporting archaic ash and thorn in English (mediaeval, ye olde coffee shoppe),
> when half the population think the latter is a y as in yellow.

Im not sure of overal utf-8 is the good compromise, ascii is simple utf32 is simple (i hope, dont know deep details) so maybe
those interfacing wouldnt be so hard (should be binary trivial and thats a big value, those oldschool value that is lost when you use utf-8 (and need to rely on external libraries rather than writing own routines in own hand if need))
(still im not sure depends if utf-32 has no
weird glitches and if it is really binary easy format)
(there is still a wuestion if LE of BE, i tend to say that it should be native in ram and probably both format allowed in files, though with some tendency to favorize big endian as international standard)

[toc] | [prev] | [next] | [standalone]


#77639

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-02 11:21 -0600
Message-ID<n3n969$k70$1@dont-email.me>
In reply to#77636
On 02-Dec-15 11:07, fir wrote:
> Malcolm McLean napisał:
>> If ascii had never achieved any traction outside of North America,
>> then I think there would be a strong case for UTF-32. Reality is
>> that there are masses and masses of ascii interfaces around, and it
>> would be a nightmare job to track them all down and either rip them
>> out or write little adapter functions to make them talk to the rest
>> of the world in UTF-32.
>> 
>> UTF-8 is the best compromise. ...
> 
> Im not sure of overal utf-8 is the good compromise, ascii is simple
> utf32 is simple (i hope, dont know deep details) so maybe those
> interfacing wouldnt be so hard (should be binary trivial and thats a
> big value, those oldschool value that is lost when you use utf-8 (and
> need to rely on external libraries rather than writing own routines
> in own hand if need)) (still im not sure depends if utf-32 has no 
> weird glitches and if it is really binary easy format) (there is
> still a wuestion if LE of BE, i tend to say that it should be native
> in ram and probably both format allowed in files, though with some
> tendency to favorize big endian as international standard)

UTF-32's simplicity comes at the cost of embedded NUL characters, so
it's inherently incompatible with all existing C string-handling code.
UTF-8 isn't perfect, but at least it is _usually_ compatible, and it has
the side benefits of being endian-neutral and generally smaller.  UTF-16
takes the worst of both and the best of neither.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77642

Fromfir <profesor.fir@gmail.com>
Date2015-12-02 09:40 -0800
Message-ID<ae1212b2-dd3f-46e3-8745-7ba23971f641@googlegroups.com>
In reply to#77639
W dniu środa, 2 grudnia 2015 18:21:46 UTC+1 użytkownik Stephen Sprunk napisał:
> On 02-Dec-15 11:07, fir wrote:
> > Malcolm McLean napisał:
> >> If ascii had never achieved any traction outside of North America,
> >> then I think there would be a strong case for UTF-32. Reality is
> >> that there are masses and masses of ascii interfaces around, and it
> >> would be a nightmare job to track them all down and either rip them
> >> out or write little adapter functions to make them talk to the rest
> >> of the world in UTF-32.
> >> 
> >> UTF-8 is the best compromise. ...
> > 
> > Im not sure of overal utf-8 is the good compromise, ascii is simple
> > utf32 is simple (i hope, dont know deep details) so maybe those
> > interfacing wouldnt be so hard (should be binary trivial and thats a
> > big value, those oldschool value that is lost when you use utf-8 (and
> > need to rely on external libraries rather than writing own routines
> > in own hand if need)) (still im not sure depends if utf-32 has no 
> > weird glitches and if it is really binary easy format) (there is
> > still a wuestion if LE of BE, i tend to say that it should be native
> > in ram and probably both format allowed in files, though with some
> > tendency to favorize big endian as international standard)
> 
> UTF-32's simplicity comes at the cost of embedded NUL characters, so
> it's inherently incompatible with all existing C string-handling code.
> UTF-8 isn't perfect, but at least it is _usually_ compatible, and it has
> the side benefits of being endian-neutral and generally smaller.  UTF-16
> takes the worst of both and the best of neither.
> 
on utf-16 i wouldnt like to speak at all,

on utf-8 - I know - but those advantages and disadvantages come versus utf-32 advantages and disadvantages, and my point is that utf-8 advantages you get as a cost of general (thus very heavy) mess,

 (and now as utf-8 and-16 are common world is really polluted by this unicode mess (this is mess sorta like various html versions, of various support, all that mess) - it is probably not worth it) (people will get used to it, as people will get used to anything but it doesnt mean that utf-32  world would not be far better really)

still im not sure if on windows i got easy way just to unify all my unicode with utf-32
(probably no, as they enforce a variant of utf-16 afaik) but still i just mean that utf-32 is most logical option to me, (blah )

[toc] | [prev] | [next] | [standalone]


#77648

FromKeith Thompson <kst-u@mib.org>
Date2015-12-02 11:22 -0800
Message-ID<lnoae8sljm.fsf@kst-u.example.com>
In reply to#77639
Stephen Sprunk <stephen@sprunk.org> writes:
[...]
> UTF-32's simplicity comes at the cost of embedded NUL characters, so
> it's inherently incompatible with all existing C string-handling code.
> UTF-8 isn't perfect, but at least it is _usually_ compatible, and it has
> the side benefits of being endian-neutral and generally smaller.  UTF-16
> takes the worst of both and the best of neither.

UTF-32 has that cost if it's encoded as a sequence of 4 octets per
character.

If wchar_t is 32 bits, then the standard library functions that handle
arrays of wchar_t (wcslen() et al) don't have that problem; only a
32-bit zero is treated as a (wide) null character.

On the other hand, MS Windows has 16-bit wchar_t.

On the other other hand, C11 adds char16_t and char32_t.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#77671

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-02 15:59 -0600
Message-ID<n3npg5$os5$1@dont-email.me>
In reply to#77648
On 02-Dec-15 13:22, Keith Thompson wrote:
> Stephen Sprunk <stephen@sprunk.org> writes: [...]
>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>> so it's inherently incompatible with all existing C string-handling
>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>> and it has the side benefits of being endian-neutral and generally
>> smaller.  UTF-16 takes the worst of both and the best of neither.
> 
> UTF-32 has that cost if it's encoded as a sequence of 4 octets per 
> character.

UTF-32 is, by definition, an encoding of exactly one 32-bit code unit
per code point.

Depending on what you mean by "character", though, it may require a
variable number of code points, and there are also code points for
non-characters.  That's the _real_ problem, so UTF-32's alleged fixed
width is misleading.

> If wchar_t is 32 bits, then the standard library functions that
> handle arrays of wchar_t (wcslen() et al) don't have that problem;
> only a 32-bit zero is treated as a (wide) null character.

That solves the NUL-terminated string issue, but it also means every
function with a string argument or return must be replaced or, worse,
duplicated.  Ouch.

Worse, all that pain doesn't even solve the _real_ problems!

> On the other hand, MS Windows has 16-bit wchar_t.

That violates the C Standard's requirements, but it's also a popular
enough platform that to ignore its problems may be unwise.

> On the other other hand, C11 adds char16_t and char32_t.

That seems like a sop to Microsoft.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77683

FromKeith Thompson <kst-u@mib.org>
Date2015-12-02 16:25 -0800
Message-ID<lny4dcqsxu.fsf@kst-u.example.com>
In reply to#77671
Stephen Sprunk <stephen@sprunk.org> writes:
> On 02-Dec-15 13:22, Keith Thompson wrote:
>> Stephen Sprunk <stephen@sprunk.org> writes: [...]
>>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>>> so it's inherently incompatible with all existing C string-handling
>>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>>> and it has the side benefits of being endian-neutral and generally
>>> smaller.  UTF-16 takes the worst of both and the best of neither.
>> 
>> UTF-32 has that cost if it's encoded as a sequence of 4 octets per 
>> character.
>
> UTF-32 is, by definition, an encoding of exactly one 32-bit code unit
> per code point.

Surely that 32-bit code unit can be represented by a sequence of 4
octets.  For example, if I type

    echo hello | iconv -f utf-8 -t utf-32 > hello.utf32

I get a file that represents each character as 4 bytes (and that starts
with a 4-byte BOM).

> Depending on what you mean by "character", though, it may require a
> variable number of code points, and there are also code points for
> non-characters.  That's the _real_ problem, so UTF-32's alleged fixed
> width is misleading.

Yes, I was glossing over that issue.

>> If wchar_t is 32 bits, then the standard library functions that
>> handle arrays of wchar_t (wcslen() et al) don't have that problem;
>> only a 32-bit zero is treated as a (wide) null character.
>
> That solves the NUL-terminated string issue, but it also means every
> function with a string argument or return must be replaced or, worse,
> duplicated.  Ouch.

But that's pretty much already done; wcslen() et al are part of the
standard library.

> Worse, all that pain doesn't even solve the _real_ problems!
>
>> On the other hand, MS Windows has 16-bit wchar_t.
>
> That violates the C Standard's requirements, but it's also a popular
> enough platform that to ignore its problems may be unwise.

Does it?  wchar_t is supposed to be "an integer type whose range of
values can represent distinct codes for all members of the largest
extended character set specified among the supported locales".  An
conforming implementation whose largest extended character set has no
more than 65536 characters could legally use 16-bit wchar_t.  I don't
know whether that applies to Microsoft's implementation.

>> On the other other hand, C11 adds char16_t and char32_t.
>
> That seems like a sop to Microsoft.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#77694

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-02 19:47 -0600
Message-ID<n3o6rq$9mt$1@dont-email.me>
In reply to#77683
On 02-Dec-15 18:25, Keith Thompson wrote:
> Stephen Sprunk <stephen@sprunk.org> writes:
>> On 02-Dec-15 13:22, Keith Thompson wrote:
>>> Stephen Sprunk <stephen@sprunk.org> writes: [...]
>>>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>>>> so it's inherently incompatible with all existing C string-handling
>>>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>>>> and it has the side benefits of being endian-neutral and generally
>>>> smaller.  UTF-16 takes the worst of both and the best of neither.
>>>
>>> UTF-32 has that cost if it's encoded as a sequence of 4 octets per 
>>> character.
>>
>> UTF-32 is, by definition, an encoding of exactly one 32-bit code unit
>> per code point.
> 
> Surely that 32-bit code unit can be represented by a sequence of 4
> octets.  For example, if I type
> 
>     echo hello | iconv -f utf-8 -t utf-32 > hello.utf32
> 
> I get a file that represents each character as 4 bytes (and that starts
> with a 4-byte BOM).

Of course; with UTF-32LE, the 32-bit code unit is represented with one
set of 4 bytes, and with UTF-32BE, the 32-bit code unit is represented
with a _different_ set of 4 bytes, which is why you need the BOM to
distinguish them.

OTOH, a UTF-32LE BOM looks exactly like a UTF-16LE BOM followed by a
NUL, so there's no way to reliably determine which was used to encode
some files.  Oops.  And neither can be reliably distinguished from a
non-UTF-16/32 file that just happens to start with 0xFE 0xFF, a valid
byte sequence in many other encodings (but notably _not_ UTF-8).

>>> If wchar_t is 32 bits, then the standard library functions that
>>> handle arrays of wchar_t (wcslen() et al) don't have that problem;
>>> only a 32-bit zero is treated as a (wide) null character.
>>
>> That solves the NUL-terminated string issue, but it also means every
>> function with a string argument or return must be replaced or, worse,
>> duplicated.  Ouch.
> 
> But that's pretty much already done; wcslen() et al are part of the
> standard library.

It's not just the Standard Library; it's every function in every program
or library ever written that takes or returns a string.

For a real-world example, look at the Windows API.

>> Worse, all that pain doesn't even solve the _real_ problems!
>> 
>>> On the other hand, MS Windows has 16-bit wchar_t.
>> 
>> That violates the C Standard's requirements, but it's also a
>> popular enough platform that to ignore its problems may be unwise.
> 
> Does it?  wchar_t is supposed to be "an integer type whose range of 
> values can represent distinct codes for all members of the largest 
> extended character set specified among the supported locales".  An 
> conforming implementation whose largest extended character set has
> no more than 65536 characters could legally use 16-bit wchar_t.  I
> don't know whether that applies to Microsoft's implementation.

Supporting non-BMP characters (i.e. >65536 total) was what drove their
switch from UCS-2 (which complied) to UTF-16 (which doesn't).

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77676

Fromsupercat@casperkitty.com
Date2015-12-02 14:38 -0800
Message-ID<20019f4f-2d82-4b0c-9144-ce1513139b52@googlegroups.com>
In reply to#77648
On Wednesday, December 2, 2015 at 1:22:34 PM UTC-6, Keith Thompson wrote:
> On the other other hand, C11 adds char16_t and char32_t.

Neither of which, interestingly enough, counts as a "character" type despite the name.

[toc] | [prev] | [next] | [standalone]


#77684

FromKeith Thompson <kst-u@mib.org>
Date2015-12-02 16:26 -0800
Message-ID<lntwo0qsvk.fsf@kst-u.example.com>
In reply to#77676
supercat@casperkitty.com writes:
> On Wednesday, December 2, 2015 at 1:22:34 PM UTC-6, Keith Thompson wrote:
>> On the other other hand, C11 adds char16_t and char32_t.
>
> Neither of which, interestingly enough, counts as a "character" type
> despite the name.

Neither is wchar_t.  char, unsigned char, and signed char are the only
"character types".

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#78278

FromTim Rentsch <txr@alumni.caltech.edu>
Date2015-12-09 11:33 -0800
Message-ID<kfnsi3bv2lc.fsf@x-alumni2.alumni.caltech.edu>
In reply to#77684
Keith Thompson <kst-u@mib.org> writes:

> supercat@casperkitty.com writes:
>> On Wednesday, December 2, 2015 at 1:22:34 PM UTC-6, Keith Thompson wrote:
>>> On the other other hand, C11 adds char16_t and char32_t.
>>
>> Neither of which, interestingly enough, counts as a "character" type
>> despite the name.
>
> Neither is wchar_t.  char, unsigned char, and signed char are the only
> "character types".

IIANM, any or all of wchar_t, char16_t, and char32_t can be character
types, depending on the implementation.

[toc] | [prev] | [next] | [standalone]


#78284

FromKeith Thompson <kst-u@mib.org>
Date2015-12-09 12:21 -0800
Message-ID<lnk2onl6ei.fsf@kst-u.example.com>
In reply to#78278
Tim Rentsch <txr@alumni.caltech.edu> writes:
> Keith Thompson <kst-u@mib.org> writes:
>> supercat@casperkitty.com writes:
>>> On Wednesday, December 2, 2015 at 1:22:34 PM UTC-6, Keith Thompson wrote:
>>>> On the other other hand, C11 adds char16_t and char32_t.
>>>
>>> Neither of which, interestingly enough, counts as a "character" type
>>> despite the name.
>>
>> Neither is wchar_t.  char, unsigned char, and signed char are the only
>> "character types".
>
> IIANM, any or all of wchar_t, char16_t, and char32_t can be character
> types, depending on the implementation.

You're right, of course.  Any of them is a character type if and
only if it's a typedef for char, unsigned char, or signed char.
(On most implementations, they aren't.)

One might reach the conclusion that the standard's definition of the
phrase "character type" is confusing and misleading.  Presumably the
term was defined before wchar_t and friends were added to the language,
and not updated.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


#77721

FromDavid Brown <david.brown@hesbynett.no>
Date2015-12-03 11:28 +0100
Message-ID<n3p5bv$q3j$1@dont-email.me>
In reply to#77639
On 02/12/15 18:21, Stephen Sprunk wrote:
> On 02-Dec-15 11:07, fir wrote:
>> Malcolm McLean napisał:
>>> If ascii had never achieved any traction outside of North America,
>>> then I think there would be a strong case for UTF-32. Reality is
>>> that there are masses and masses of ascii interfaces around, and it
>>> would be a nightmare job to track them all down and either rip them
>>> out or write little adapter functions to make them talk to the rest
>>> of the world in UTF-32.
>>>
>>> UTF-8 is the best compromise. ...
>>
>> Im not sure of overal utf-8 is the good compromise, ascii is simple
>> utf32 is simple (i hope, dont know deep details) so maybe those
>> interfacing wouldnt be so hard (should be binary trivial and thats a
>> big value, those oldschool value that is lost when you use utf-8 (and
>> need to rely on external libraries rather than writing own routines
>> in own hand if need)) (still im not sure depends if utf-32 has no 
>> weird glitches and if it is really binary easy format) (there is
>> still a wuestion if LE of BE, i tend to say that it should be native
>> in ram and probably both format allowed in files, though with some
>> tendency to favorize big endian as international standard)
> 
> UTF-32's simplicity comes at the cost of embedded NUL characters, so
> it's inherently incompatible with all existing C string-handling code.
> UTF-8 isn't perfect, but at least it is _usually_ compatible, and it has
> the side benefits of being endian-neutral and generally smaller.  UTF-16
> takes the worst of both and the best of neither.
> 

UTF-32 also has endian issues, and while it has one code unit (i.e.,
32-bit number) per code point (i.e., Unicode character), you still don't
have a one-to-one correspondence between code points and glyphs.  So
UTF-32 does not make unicode as easy as ASCII - nor does it make it
fixed length.  For example, é can be made from a single character
U+00E9, or from two characters: e and ́́  which combine to look like é .
You cannot therefore assume that one 32-bit code unit is one character.

So using UTF-32 simplifies some aspects of unicode, while keeping some
complications that are inherent in unicode and introducing some of its own.

Thus by far the most common choice for data transfer (files, protocols,
etc.) is UTF-8, while UTF-32 is a common choice for an internal format
within a program (where endian issues are not relevant).  UTF-16 is the
worst of both worlds, and (especially on Windows) is often mixed with
UCS-2 which has limited range and fails with anything outside the BMP.
(Noting, however, that non-BMP characters are rare except for CJK - and
often these scripts use different encodings anyway.)



[toc] | [prev] | [next] | [standalone]


#77743

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-03 08:50 -0600
Message-ID<n3pkmu$kll$1@dont-email.me>
In reply to#77721
On 03-Dec-15 04:28, David Brown wrote:
> On 02/12/15 18:21, Stephen Sprunk wrote:
>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>> so it's inherently incompatible with all existing C string-handling
>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>> and it has the side benefits of being endian-neutral and generally
>> smaller.  UTF-16 takes the worst of both and the best of neither.
> 
> UTF-32 also has endian issues, and while it has one code unit (i.e., 
> 32-bit number) per code point (i.e., Unicode character), you still
> don't have a one-to-one correspondence between code points and
> glyphs.

ITYM "grapheme clusters" for the latter.

A "glyph" is the visual rendering of a "grapheme" in a certain font, and
a "grapheme cluster" may require multiple glyphs.

For example, "A" in Times and "A" in Helvetica are the same grapheme
(and code point) but different glyphs.  OTOH, Latin "A" and Greek "Α"
are different graphemes (and code points) but typically map to the same
set of glyphs.

> (Noting, however, that non-BMP characters are rare except for CJK

That depends; all of the new emoji are non-BMP, for instance, and many
of us encounter those on a daily basis. I wouldn't call that "rare".

> - and often these scripts use different encodings anyway.)

ShiftJIS still has measurable usage in Japan but is steadily losing
ground to UTF-8.  Despite the PRC govt's mandate that everyone use
GB18030/GB2312, UTF-8 clearly dominates there, same as in ROC and ROK.
It's unclear what DPRK uses--or if an answer is even meaningful.

> ...

The rest of your post seems to be a restatement of what I've already
said in other posts.  Were you trying to collect it all in one place for
the convenience of other readers?

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77752

FromDavid Brown <david.brown@hesbynett.no>
Date2015-12-03 16:38 +0100
Message-ID<n3pnhe$kf$1@dont-email.me>
In reply to#77743
On 03/12/15 15:50, Stephen Sprunk wrote:
> On 03-Dec-15 04:28, David Brown wrote:
>> On 02/12/15 18:21, Stephen Sprunk wrote:
>>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>>> so it's inherently incompatible with all existing C string-handling
>>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>>> and it has the side benefits of being endian-neutral and generally
>>> smaller.  UTF-16 takes the worst of both and the best of neither.
>>
>> UTF-32 also has endian issues, and while it has one code unit (i.e., 
>> 32-bit number) per code point (i.e., Unicode character), you still
>> don't have a one-to-one correspondence between code points and
>> glyphs.
> 
> ITYM "grapheme clusters" for the latter.
> 
> A "glyph" is the visual rendering of a "grapheme" in a certain font, and
> a "grapheme cluster" may require multiple glyphs.
> 
> For example, "A" in Times and "A" in Helvetica are the same grapheme
> (and code point) but different glyphs.  OTOH, Latin "A" and Greek "Α"
> are different graphemes (and code points) but typically map to the same
> set of glyphs.

I believe you are correct.  The terminology is complicated, and easy to
get wrong - thanks for the clear explanation.

> 
>> (Noting, however, that non-BMP characters are rare except for CJK
> 
> That depends; all of the new emoji are non-BMP, for instance, and many
> of us encounter those on a daily basis. I wouldn't call that "rare".
> 

I didn't know these had their own unicode points - I have always thought
of them as being combinations of ASCII characters like colon, hyphen,
parenthesis :-)

That's me learned two things in one post - probably a record!

>> - and often these scripts use different encodings anyway.)
> 
> ShiftJIS still has measurable usage in Japan but is steadily losing
> ground to UTF-8.  Despite the PRC govt's mandate that everyone use
> GB18030/GB2312, UTF-8 clearly dominates there, same as in ROC and ROK.
> It's unclear what DPRK uses--or if an answer is even meaningful.
> 
>> ...
> 
> The rest of your post seems to be a restatement of what I've already
> said in other posts.  Were you trying to collect it all in one place for
> the convenience of other readers?
> 

There have been a great many posts in a couple of threads about unicode
just recently - some repetition is inevitable, and I might make a new
post before having read all the other posts.  But I was not directing my
post to you specifically.

[toc] | [prev] | [next] | [standalone]


#77758

FromStephen Sprunk <stephen@sprunk.org>
Date2015-12-03 10:01 -0600
Message-ID<n3poro$5v1$1@dont-email.me>
In reply to#77752
On 03-Dec-15 09:38, David Brown wrote:
> On 03/12/15 15:50, Stephen Sprunk wrote:
>> On 03-Dec-15 04:28, David Brown wrote:
>>> UTF-32 also has endian issues, and while it has one code unit (i.e., 
>>> 32-bit number) per code point (i.e., Unicode character), you still
>>> don't have a one-to-one correspondence between code points and
>>> glyphs.
>>
>> ITYM "grapheme clusters" for the latter.
>>
>> A "glyph" is the visual rendering of a "grapheme" in a certain
>> font, and a "grapheme cluster" may require multiple glyphs.
>> 
>> For example, "A" in Times and "A" in Helvetica are the same
>> grapheme (and code point) but different glyphs.  OTOH, Latin "A"
>> and Greek "Α" are different graphemes (and code points) but
>> typically map to the same set of glyphs.
> 
> I believe you are correct.  The terminology is complicated, and easy
> to get wrong - thanks for the clear explanation.

IMHO, it's not all that complicated; it's just a field that most of us
haven't encountered before.  You're probably familiar with phonemes due
to IPA, and graphemes are the same idea applied to writing.  Sememes are
the same idea again applied to meaning, which is important in Unicode's
Han Unification of CJK scripts.

>>> (Noting, however, that non-BMP characters are rare except for
>>> CJK
>> 
>> That depends; all of the new emoji are non-BMP, for instance, and
>> many of us encounter those on a daily basis. I wouldn't call that
>> "rare".
> 
> I didn't know these had their own unicode points - I have always
> thought of them as being combinations of ASCII characters like colon,
> hyphen, parenthesis :-)

AFAICT, that's the difference between emoticons, e.g. ":)", and emoji,
e.g. "☺️".  The latter are mostly in U+26xx, U+27xx and U+1Fxxx, but
they can be found scattered throughout various other blocks too.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]


#77768

FromKeith Thompson <kst-u@mib.org>
Date2015-12-03 09:46 -0800
Message-ID<lna8prqvc3.fsf@kst-u.example.com>
In reply to#77743
Stephen Sprunk <stephen@sprunk.org> writes:
[...]
> A "glyph" is the visual rendering of a "grapheme" in a certain font, and
> a "grapheme cluster" may require multiple glyphs.

And if you filter the text through rot13 before printing it, you can
render unto Caesar.

[...]

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]


Page 1 of 8  [1] 2 3 4 5 6 7 8  Next page →

Back to top | Article view | comp.lang.c


csiph-web