Groups > comp.lang.c > #77357 > unrolled thread

Working efficiently with 32-bit Unicode output streams, locale etc.

Started by	"Morten W. Petersen" <morphex@gmail.com>
First post	2015-11-29 01:06 +0100
Last post	2015-12-02 09:58 -0800
Articles	20 on this page of 210 — 25 participants

Back to article view | Back to comp.lang.c

  Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-29 01:06 +0100
    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Nobody <nobody@nowhere.invalid> - 2015-11-29 02:01 +0000
      Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-29 03:31 +0100
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 00:09 -0600
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Robert Wessel <robertwessel2@yahoo.com> - 2015-11-29 00:22 -0600
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Damon <Richard@Damon-Family.org> - 2015-11-29 14:31 -0500
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Nobody <nobody@nowhere.invalid> - 2015-11-29 23:51 +0000
          Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 01:21 +0100
            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-11-30 00:41 -0800
            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 03:16 -0600
      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-11-29 08:28 +0000
      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 02:54 -0600
    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ian Collins <ian-news@hotmail.com> - 2015-11-29 16:30 +1300
      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-28 23:53 -0800
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 02:23 -0600
          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-29 00:30 -0800
            Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 01:33 +0100
              Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 13:54 +1300
                Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 02:03 +0100
                  Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 14:15 +1300
                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 02:34 +0100
                      Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 14:42 +1300
                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 04:16 +0100
                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 20:20 -0600
                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 04:34 +0100
                          Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 17:09 +1300
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 06:17 +0100
                              Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 19:44 +1300
                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 23:36 -0600
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 07:39 +0100
                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 13:56 -0600
                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-01 09:17 +0100
                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:40 -0600
                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-04 00:34 +0100
                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 16:03 -0800
                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-29 23:07 -0800
                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 08:20 +0100
                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-29 23:40 -0800
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 08:48 +0100
                              Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 20:52 +1300
                                Re: Working efficiently with 32-bit Unicode output streams, locale     etc. Ian Collins <ian-news@hotmail.com> - 2015-11-30 21:04 +1300
                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-30 00:34 -0800
                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 03:50 -0600
                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-11-30 12:16 +0000
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-30 06:11 -0800
                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 13:23 -0600
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 13:18 -0600
                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-11-30 13:23 -0800
                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-11-30 22:32 +0000
                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-11-30 15:10 -0800
                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-30 21:05 -0600
                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-01 12:38 +0000
                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-01 14:43 +0000
                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-01 12:09 -0800
                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ian Collins <ian-news@hotmail.com> - 2015-12-02 09:14 +1300
                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-01 12:27 -0800
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ian Collins <ian-news@hotmail.com> - 2015-12-02 10:14 +1300
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-01 18:01 -0600
                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-01 20:41 +0000
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-01 12:53 -0800
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-01 21:32 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-01 13:55 -0800
                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. raltbos@xs4all.nl (Richard Bos) - 2015-12-04 10:30 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-01 18:46 -0600
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Say, what? <<nothing@nowhere.nohow>> - 2015-12-01 14:07 -0800
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-01 23:54 +0000
                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Say, what? <<nothing@nowhere.nohow>> - 2015-12-01 17:13 -0800
                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Martin Shobe <martin.shobe@yahoo.com> - 2015-12-01 09:08 -0600
                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-01 20:02 +0000
                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Martin Shobe <martin.shobe@yahoo.com> - 2015-12-01 17:03 -0600
                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 00:17 +0000
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-01 16:53 -0800
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Martin Shobe <martin.shobe@yahoo.com> - 2015-12-01 21:17 -0600
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 09:37 -0600
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. James Kuyper <jameskuyper@verizon.net> - 2015-12-02 10:59 -0500
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 17:43 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:22 -0600
                                                Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-03 09:32 +1300
                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 21:12 +0000
                                                    Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-03 10:36 +1300
                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 22:00 +0000
                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 17:55 -0600
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-02 17:04 -0800
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-03 01:11 +0000
                                                            Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-03 14:19 +1300
                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 23:16 -0600
                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 00:54 -0600
                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 04:07 -0800
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 18:31 +0000
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Eric Sosman <esosman@comcast-dot-net.invalid> - 2015-12-03 13:59 -0500
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 19:45 +0000
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-03 14:38 -0800
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 22:43 +0000
                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-03 12:14 +0000
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 12:38 +0000
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-03 13:19 +0000
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 05:54 -0800
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. raltbos@xs4all.nl (Richard Bos) - 2015-12-04 10:50 +0000
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 14:26 +0000
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 09:19 -0600
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-03 16:25 +0100
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 15:33 +0000
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-03 16:47 +0100
                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 16:54 +0000
                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-03 09:32 -0800
                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-03 18:53 +0100
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Steve Thompson <stevet810@gmail.com> - 2015-12-03 19:00 +0000
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-04 14:07 +0100
                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:41 +0000
                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-05 16:09 +0100
                                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Steve Thompson <stevet810@gmail.com> - 2015-12-05 21:15 +0000
                                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-06 12:35 +0100
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-03 09:02 -0800
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-03 19:12 +0000
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 16:58 -0600
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-03 15:47 +0100
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 14:51 +0000
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-03 16:50 +0100
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. raltbos@xs4all.nl (Richard Bos) - 2015-12-04 10:55 +0000
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 08:56 -0600
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 05:24 -0800
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-04 08:49 +1300
                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-03 07:07 -0800
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 10:27 -0600
                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-03 09:01 -0800
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. fir <profesor.fir@gmail.com> - 2015-12-03 10:16 -0800
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-04 01:21 +0100
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 16:42 -0800
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-04 11:15 +0100
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-08 01:57 +0100
                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. David Brown <david.brown@hesbynett.no> - 2015-12-08 09:08 +0100
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 09:44 -0600
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-04 15:58 +0000
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 11:43 -0600
                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Geoff <geoff@invalid.invalid> - 2015-12-04 10:56 -0800
                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-04 11:20 -0800
                                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 15:24 -0600
                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 09:30 -0600
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-04 15:52 +0000
                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-04 09:07 -0800
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 09:53 -0800
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-04 10:56 -0800
                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 15:04 -0600
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-04 21:32 +0000
                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-04 13:38 -0800
                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 16:13 -0600
                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 16:21 -0800
                                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 19:10 -0600
                                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Geoff <geoff@invalid.invalid> - 2015-12-04 19:16 -0800
                                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-04 21:19 -0800
                                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 12:44 -0600
                                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-06 09:01 -0800
                                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 12:34 -0600
                                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-06 18:32 -0800
                                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 10:43 -0600
                                                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-07 10:02 -0800
                                                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 03:53 -0800
                                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-05 09:39 -0800
                                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-05 18:36 +0000
                                                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 12:26 -0600
                                                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 11:36 -0800
                                                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Udyant Wig <udyantw@gmail.com> - 2015-12-06 16:42 +0530
                                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 03:59 -0800
                                                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Robert Wessel <robertwessel2@yahoo.com> - 2015-12-07 02:17 -0600
                                                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. supercat@casperkitty.com - 2015-12-07 07:33 -0800
                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. fir <profesor.fir@gmail.com> - 2015-12-03 03:57 -0800
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-04 00:58 +0100
                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 01:34 +0000
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-03 11:38 +0000
                                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 14:09 +0000
                                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 10:10 -0600
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 08:28 -0800
                                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 21:33 +0000
                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Heathfield <rjh@cpax.org.uk> - 2015-12-02 21:47 +0000
                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 16:05 -0600
                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-02 14:12 -0800
                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 22:47 +0000
                                                        Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-03 14:00 +1300
                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 01:38 -0600
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 02:20 -0800
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. raltbos@xs4all.nl (Richard Bos) - 2015-12-04 10:40 +0000
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Nobody <nobody@nowhere.invalid> - 2015-12-03 02:42 +0000
                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Damon <Richard@Damon-Family.org> - 2015-12-01 20:48 -0500
                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 12:08 +0000
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 04:21 -0800
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 14:05 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-12-04 01:31 +0100
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-02 14:23 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 08:00 -0800
                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-02 16:49 +0000
                                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 11:50 -0800
                                                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-02 20:02 +0000
                                                        Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 12:31 -0800
                                                          Re: Working efficiently with 32-bit Unicode output streams, locale etc. Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 01:43 +0000
                                                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-12-02 09:21 -0800
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Richard Damon <Richard@Damon-Family.org> - 2015-12-02 07:29 -0500
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 05:47 -0800
                                                Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:03 -0600
                                              Re: Working efficiently with 32-bit Unicode output streams, locale etc. BartC <bc@freeuk.com> - 2015-12-02 14:16 +0000
                                                Re: Working efficiently with 32-bit Unicode output streams, locale   etc. Ian Collins <ian-news@hotmail.com> - 2015-12-03 09:56 +1300
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:49 -0600
                                            Re: Working efficiently with 32-bit Unicode output streams, locale etc. Philip Lantz <prl@canterey.us> - 2015-12-02 22:11 -0800
                                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:06 -0600
                      Re: Working efficiently with 32-bit Unicode output streams, locale etc. Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-11-30 22:14 +0000
              Re: Working efficiently with 32-bit Unicode output streams, locale etc. Stephen Sprunk <stephen@sprunk.org> - 2015-11-29 23:03 -0600
                Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-30 06:26 +0100
                  Re: Working efficiently with 32-bit Unicode output streams, locale etc. Keith Thompson <kst-u@mib.org> - 2015-11-30 00:39 -0800
                    Re: Working efficiently with 32-bit Unicode output streams, locale etc. Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-11-30 01:57 -0800
        Re: Working efficiently with 32-bit Unicode output streams, locale etc. "Morten W. Petersen" <morphex@gmail.com> - 2015-11-29 15:32 +0100
    Re: Working efficiently with 32-bit Unicode output streams, locale etc. fir <profesor.fir@gmail.com> - 2015-12-02 09:58 -0800

Page 3 of 11 — ← Prev page 1 2 [3] 4 5 … 11 Next page →

#77444 — Re: Working efficiently with 32-bit Unicode output streams, locale etc.

From	Ian Collins <ian-news@hotmail.com>
Date	2015-11-30 21:04 +1300
Subject	Re: Working efficiently with 32-bit Unicode output streams, locale etc.
Message-ID	<dc2e82Fi96mU2@mid.individual.net>
In reply to	#77443

Ian Collins wrote:
> Morten W. Petersen wrote:
>>
>> One is a lot simpler than the other..  I like simple.  And given
>> that UTF8 and UTF32 streams are roughly the same size compressed,
>> and compression is cheap and available - doesn't that make UTF-32
>> a little bit simpler and more politically correct?
>
> Does it matter is no one uses it?

s/is/if/

-- 
Ian Collins

[toc] | [prev] | [next] | [standalone]

#77447

From	Malcolm McLean <malcolm.mclean5@btinternet.com>
Date	2015-11-30 00:34 -0800
Message-ID	<37a35f56-fb1a-4501-bec3-8266b4e0a41b@googlegroups.com>
In reply to	#77442

On Monday, November 30, 2015 at 7:48:19 AM UTC, Morten W. Petersen wrote:
> On 30.11.2015 08:40, Malcolm McLean wrote:
>
> Hm yes.  Then again, to get one Unicode character from a UTF-8 stream,
> you first have to read it, check it, and expand it if necessary.
> 
> To get one Unicode character from a UTF-32 stream, you read 4 bytes
> and add them up.
> 
Yes, but it's one small routine. Once its written and debugged, that's
it. Problem solved. Speed is unlikely to be an issue. 
>
> One is a lot simpler than the other..  I like simple.  And given
> that UTF8 and UTF32 streams are roughly the same size compressed,
> and compression is cheap and available - doesn't that make UTF-32
> a little bit simpler and more politically correct?
> 
The advantage of UTF-8 is that it's backwards compatible, and that
on a system where memory for strings is an issue, it's likely that
most of those strings are a mixture of English and programming symbols,
so the coding takes up one byte per character whilst still offering
facilities for occasional extended characters. The other advantage is
that, in an effort to be all things to all men, UTF-16 and UTF-32
allowed either byte ordering, which is just a nuisance to readers.

The political objections are really without merit. Hindu culture
is served, not hindered, by having one standard coding that works,
and supports Indian scripts. If internationalisation fails because
of too many incompatible unicode standards, then you'll find that
English-only systems persist for longer.

[toc] | [prev] | [next] | [standalone]

#77452

From	Stephen Sprunk <stephen@sprunk.org>
Date	2015-11-30 03:50 -0600
Message-ID	<n3h61a$v9h$1@dont-email.me>
In reply to	#77439

On 30-Nov-15 01:07, Malcolm McLean wrote:
> Morten W. Petersen wrote:
>> On 30.11.2015 02:15, Ian Collins wrote:
>> Well, let's say you have some organization that wants to create an 
>> archive of lots of non-latin history, in XML.
>> 
>> For them, choosing XML is right, and UTF-8 uses 3 bytes on
>> characters U+0800 through U+FFFF, but only 2 bytes in UTF-16.
>> ...
>> As for the rest of the UTF-8 vs 16 and 32 debate, look at the
>> earlier discussion on comp.lang.c.
> 
> The debate isn't entirely over.

Yes, it is.  Even for scripts where UTF-8 results in more bytes than
UTF-16, UTF-8 has become the dominant choice by users, and that shift is
accelerating, not declining--much less reversing.

So, the politicians can complain all they want, but the people they
claim to represent clearly disagree with them.

> Some Indians (Hindu, not red) don't like UTF-8 because Indian
> characters are represented by longer sequences, which they see as
> giving second status to their culture.

Let's see what all is in that range:

U+0800..U+083F	Samaritan
U+0840..U+085F	Mandaic
U+08A0..U+08FF	Arabic Extended-A
U+0900..U+097F	Devanagari
U+0980..U+09FF	Bengali
U+0A00..U+0A7F	Gurmukhi
U+0A80..U+0AFF	Gujarati
U+0B00..U+0B7F	Oriya
U+0B80..U+0BFF	Tamil
U+0C00..U+0C7F	Telugu
U+0C80..U+0CFF	Kannada
U+0D00..U+0D7F	Malayalam
U+0D80..U+0DFF	Sinhala
U+0E00..U+0E7F	Thai
U+0E80..U+0EFF	Lao
U+0F00..U+0FFF	Tibetan
U+1000..U+109F	Myanmar
U+10A0..U+10FF	Georgian
U+1100..U+11FF	Hangul Jamo
U+1200..U+137F	Ethiopic
U+1380..U+139F	Ethiopic Supplement
U+13A0..U+13FF	Cherokee
U+1400..U+167F	Unified Canadian Aboriginal Syllabics
U+1680..U+169F	Ogham
U+16A0..U+16FF	Runic
U+1700..U+171F	Tagalog
U+1720..U+173F	Hanunoo
U+1740..U+175F	Buhid
U+1760..U+177F	Tagbanwa
U+1780..U+17FF	Khmer
U+1800..U+18AF	Mongolian
U+18B0..U+18FF	Unified Canadian Aboriginal Syllabics Extended
U+1900..U+194F	Limbu
U+1950..U+197F	Tai Le
U+1980..U+19DF	New Tai Lue
U+19E0..U+19FF	Khmer Symbols
U+1A00..U+1A1F	Buginese
U+1A20..U+1AAF	Tai Tham
U+1AB0..U+1AFF	Combining Diacritical Marks Extended
U+1B00..U+1B7F	Balinese
U+1B80..U+1BBF	Sundanese
U+1BC0..U+1BFF	Batak
U+1C00..U+1C4F	Lepcha
U+1C50..U+1C7F	Ol Chiki
U+1CC0..U+1CCF	Sundanese Supplement
U+1CD0..U+1CFF	Vedic Extensions
U+1D00..U+1D7F	Phonetic Extensions
U+1D80..U+1DBF	Phonetic Extensions Supplement
U+1DC0..U+1DFF	Combining Diacritical Marks Supplement
U+1E00..U+1EFF	Latin Extended Additional
U+1F00..U+1FFF	Greek Extended
U+2000..U+206F	General Punctuation
U+2070..U+209F	Superscripts and Subscripts
U+20A0..U+20CF	Currency Symbols
U+20D0..U+20FF	Combining Diacritical Marks for Symbols
U+2100..U+214F	Letterlike Symbols
U+2150..U+218F	Number Forms
U+2190..U+21FF	Arrows
U+2200..U+22FF	Mathematical Operators
U+2300..U+23FF	Miscellaneous Technical
U+2400..U+243F	Control Pictures
U+2440..U+245F	Optical Character Recognition
U+2460..U+24FF	Enclosed Alphanumerics
U+2500..U+257F	Box Drawing
U+2580..U+259F	Block Elements
U+25A0..U+25FF	Geometric Shapes
U+2600..U+26FF	Miscellaneous Symbols
U+2700..U+27BF	Dingbats
U+27C0..U+27EF	Miscellaneous Mathematical Symbols-A
U+27F0..U+27FF	Supplemental Arrows-A
U+2800..U+28FF	Braille Patterns
U+2900..U+297F	Supplemental Arrows-B
U+2980..U+29FF	Miscellaneous Mathematical Symbols-B
U+2A00..U+2AFF	Supplemental Mathematical Operators
U+2B00..U+2BFF	Miscellaneous Symbols and Arrows
U+2C00..U+2C5F	Glagolitic
U+2C60..U+2C7F	Latin Extended-C
U+2C80..U+2CFF	Coptic
U+2D00..U+2D2F	Georgian Supplement
U+2D30..U+2D7F	Tifinagh
U+2D80..U+2DDF	Ethiopic Extended
U+2DE0..U+2DFF	Cyrillic Extended-A
U+2E00..U+2E7F	Supplemental Punctuation
U+2E80..U+2EFF	CJK Radicals Supplement
U+2F00..U+2FDF	Kangxi Radicals
U+2FF0..U+2FFF	Ideographic Description Characters
U+3000..U+303F	CJK Symbols and Punctuation
U+3040..U+309F	Hiragana
U+30A0..U+30FF	Katakana
U+3100..U+312F	Bopomofo
U+3130..U+318F	Hangul Compatibility Jamo
U+3190..U+319F	Kanbun
U+31A0..U+31BF	Bopomofo Extended
U+31C0..U+31EF	CJK Strokes
U+31F0..U+31FF	Katakana Phonetic Extensions
U+3200..U+32FF	Enclosed CJK Letters and Months
U+3300..U+33FF	CJK Compatibility
U+3400..U+4DBF	CJK Unified Ideographs Extension A
U+4DC0..U+4DFF	Yijing Hexagram Symbols
U+4E00..U+9FFF	CJK Unified Ideographs
U+A000..U+A48F	Yi Syllables
U+A490..U+A4CF	Yi Radicals
U+A4D0..U+A4FF	Lisu
U+A500..U+A63F	Vai
U+A640..U+A69F	Cyrillic Extended-B
U+A6A0..U+A6FF	Bamum
U+A700..U+A71F	Modifier Tone Letters
U+A720..U+A7FF	Latin Extended-D
U+A800..U+A82F	Syloti Nagri
U+A830..U+A83F	Common Indic Number Forms
U+A840..U+A87F	Phags-pa
U+A880..U+A8DF	Saurashtra
U+A8E0..U+A8FF	Devanagari Extended
U+A900..U+A92F	Kayah Li
U+A930..U+A95F	Rejang
U+A960..U+A97F	Hangul Jamo Extended-A
U+A980..U+A9DF	Javanese
U+A9E0..U+A9FF	Myanmar Extended-B
U+AA00..U+AA5F	Cham
U+AA60..U+AA7F	Myanmar Extended-A
U+AA80..U+AADF	Tai Viet
U+AAE0..U+AAFF	Meetei Mayek Extensions
U+AB00..U+AB2F	Ethiopic Extended-A
U+AB30..U+AB6F	Latin Extended-E
U+AB70..U+ABFF	Cherokee Supplement
U+ABC0..U+ABFF	Meetei Mayek
U+AC00..U+D7AF	Hangul Syllables
U+D7B0..U+D7FF	Hangul Jamo Extended-B
U+D800..U+DB7F	High Surrogates
U+DB80..U+DBFF	High Private Use Surrogates
U+DC00..U+DFFF	Low Surrogates
U+E000..U+F8FF	Private Use Area
U+F900..U+FAFF	CJK Compatibility Ideographs
U+FB00..U+FB4F	Alphabetic Presentation Forms
U+FB50..U+FDFF	Arabic Presentation Forms-A
U+FE00..U+FE0F	Variation Selectors
U+FE10..U+FE1F	Vertical Forms
U+FE20..U+FE2F	Combining Half Marks
U+FE30..U+FE4F	CJK Compatibility Forms
U+FE50..U+FE6F	Small Form Variants
U+FE70..U+FEFF	Arabic Presentation Forms-B
U+FF00..U+FFEF	Halfwidth and Fullwidth Forms
U+FFF0..U+FFFF	Specials

Looks like there is a _lot_ more in there than just India.

Also, there was never a decision to make the above second-class; the
blocks were assigned back in the UCS-2 days (prior to the invention of
UTF-8) in the order that encoding for each script was standardized, and
nobody complained at the time.  If the Unicode Consortium had been aware
of the UTF-8 length issue, I'm sure they would have put more common
scripts like Devanagari below U+0800 and less common ones like IPA,
Armenian, Syriac, Thaana and NKo above U+0800, but _nobody_ knew.

> And of course UTF-8 arrays don't easily support random access.

The vast majority of code either treats strings as opaque blobs or
traverses them sequentially.  True random access is extremely rare.

Keep in mind that UTF-16 doesn't allow random access either since it's
also a variable-length encoding, which many people forget about--and
that is a very common cause of UTF-16 bugs.

> And Microsoft has gone the UTF-16 route, as has Java.

Both went the UCS-2 route and found themselves painted into a corner
when additional planes were added.  Relabeling their UCS-2 support as
UTF-16 support was seen as less painful than switching to either UTF-8
or UTF-32/UCS-4, but that hasn't worked out so well in practice.

> But the consensus is moving to UTF-8. Certainly it's my own view
> that the other encoding should be treated as a nuisance, and only 
> converted to at the last moment to interface with systems that
> insist on them.

Agreed.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]

#77456

From	BartC <bc@freeuk.com>
Date	2015-11-30 12:16 +0000
Message-ID	<n3hejb$ujn$1@dont-email.me>
In reply to	#77452

On 30/11/2015 09:50, Stephen Sprunk wrote:
> On 30-Nov-15 01:07, Malcolm McLean wrote:

>> And of course UTF-8 arrays don't easily support random access.
>
> The vast majority of code either treats strings as opaque blobs or
> traverses them sequentially.  True random access is extremely rare.

Not in my code it isn't. (Suppose you implement a language or even a 
library which allows access to the nth character of a string. You don't 
have any say in whether the user will always access sequentially or at 
random. Which is the best string representation?)

Even with serial access, it's not so easy to iterate over a string to 
access each character (or codepoint etc) in turn if using UTF8. Code 
needs to be UTF8-aware.

Some serial access is also from the end of a string.

Wide-character strings make sense in a program, resorting to UTF8 for 
reading, writing, or dealing UTF8 APIs like the POSIX you mentioned.

It's also possible to have separate concepts of normal 'string' and a 
'serial-string', with the latter being an opaque type that you can only 
operate on with functions or by treating it as a byte-array. With of 
course conversions between the two.

> Keep in mind that UTF-16 doesn't allow random access either since it's
> also a variable-length encoding, which many people forget about--and
> that is a very common cause of UTF-16 bugs.

You'd choose between 8-bit and 32-bit strings. Except that MS uses 
16-bit (not a big deal, just another conversion. I already have to deal 
with that because most of my strings aren't zero-terminated, but Windows 
and C API string parameters usually are.)

>> And Microsoft has gone the UTF-16 route, as has Java.

I suspect that 16-bit strings would be fine in most cases. (A discussion 
elsewhere was about how it was impossible for a (programming) language 
to be case-insensitive because there is the odd character in one or two 
languages which has mismatched lower and upper case versions.)

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#77460

From	Malcolm McLean <malcolm.mclean5@btinternet.com>
Date	2015-11-30 06:11 -0800
Message-ID	<b3311303-0209-4624-b292-e16f470f1cfb@googlegroups.com>
In reply to	#77456

On Monday, November 30, 2015 at 12:17:23 PM UTC, Bart wrote:
> On 30/11/2015 09:50, Stephen Sprunk wrote:
> > On 30-Nov-15 01:07, Malcolm McLean wrote:
> 
> >> And of course UTF-8 arrays don't easily support random access.
> >
> > The vast majority of code either treats strings as opaque blobs or
> > traverses them sequentially.  True random access is extremely rare.
> 
> Not in my code it isn't. (Suppose you implement a language or even a 
> library which allows access to the nth character of a string. You don't 
> have any say in whether the user will always access sequentially or at 
> random. Which is the best string representation?)
> 
> Even with serial access, it's not so easy to iterate over a string to 
> access each character (or codepoint etc) in turn if using UTF8. Code 
> needs to be UTF8-aware.
> 
> Some serial access is also from the end of a string.
> 
> Wide-character strings make sense in a program, resorting to UTF8 for 
> reading, writing, or dealing UTF8 APIs like the POSIX you mentioned.
> 
> It's also possible to have separate concepts of normal 'string' and a 
> 'serial-string', with the latter being an opaque type that you can only 
> operate on with functions or by treating it as a byte-array. With of 
> course conversions between the two.
> 
Starting with Java, most language have kept string as immutable. Which is
viable as long as strings are short or read-only. UTF-8 is good match for
that. You can't have random read access, but you don't need to support 
random write access.
Functions like wildcard matchers need rewriting for UTF-8. It's actually a bit
of a dangerous situation as they work on the English test cases programmers
can actually read.  

Simple string don't stand up to heavy use, however. Text editors can't store 
text in a simple buffer, you need a linked list of lines, even today.

[toc] | [prev] | [next] | [standalone]

#77476

From	Stephen Sprunk <stephen@sprunk.org>
Date	2015-11-30 13:23 -0600
Message-ID	<n3i7ik$797$1@dont-email.me>
In reply to	#77460

On 30-Nov-15 08:11, Malcolm McLean wrote:
> Functions like wildcard matchers need rewriting for UTF-8. It's
> actually a bit of a dangerous situation as they work on the English
> test cases programmers can actually read.

Actually, searching for a UTF-8 string within another UTF-8 string is
perfectly safe (as long as neither is overlong), even with naïve code
designed for ASCII.  That was one of the design requirements.

That is also true of UTF-16 (and UTF-32) but is _not_ true of certain
other (pre-Unicode) encodings.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]

#77474

From	Stephen Sprunk <stephen@sprunk.org>
Date	2015-11-30 13:18 -0600
Message-ID	<n3i793$626$1@dont-email.me>
In reply to	#77456

On 30-Nov-15 06:16, BartC wrote:
> On 30/11/2015 09:50, Stephen Sprunk wrote:
>> On 30-Nov-15 01:07, Malcolm McLean wrote:
>>> And of course UTF-8 arrays don't easily support random access.
>> 
>> The vast majority of code either treats strings as opaque blobs or 
>> traverses them sequentially.  True random access is extremely
>> rare.
> 
> Not in my code it isn't. (Suppose you implement a language or even a 
> library which allows access to the nth character of a string. You
> don't have any say in whether the user will always access
> sequentially or at random. Which is the best string representation?)

Which is "best" depends on how often true random access occurs.  Even
with UTF-8, finding the Nth char (from either end) isn't difficult, and
as long as the strings are of reasonable length, nobody will notice the
slight loss of efficiency from an occasional traverse.

If fast random access _is_ a requirement, then use UTF-32 in memory;
that doesn't mean it's appropriate for a wire/file format, though.

> Even with serial access, it's not so easy to iterate over a string
> to access each character (or codepoint etc) in turn if using UTF8.
>
> Some serial access is also from the end of a string.

Traversing in either direction is trivial thanks to different encoding
of leading vs trailing bytes.  You're far more likely to screw up
traversal (or indexing) of UTF-16 strings by forgetting surrogates.

> Code needs to be UTF8-aware.

Only if it actually needs to care about code points, which are _not_
necessarily the same things as "characters".

UTF-8 was _designed_ to be transparent to the vast majority of
string-handling code.

> Wide-character strings make sense in a program, resorting to UTF8
> for reading, writing, or dealing UTF8 APIs like the POSIX you
> mentioned.

That is a popular option for programs that do unusual types of
operations on strings--or need to accommodate other encodings.

>> Keep in mind that UTF-16 doesn't allow random access either since
>> it's also a variable-length encoding, which many people forget
>> about--and that is a very common cause of UTF-16 bugs.
> 
> You'd choose between 8-bit and 32-bit strings. Except that MS uses 
> 16-bit (not a big deal, just another conversion. I already have to
> deal with that because most of my strings aren't zero-terminated, but
> Windows and C API string parameters usually are.)

If you're on Windows (or Java), then UTF-16 isn't really a choice.

>>> And Microsoft has gone the UTF-16 route, as has Java.
> 
> I suspect that 16-bit strings would be fine in most cases. (A
> discussion elsewhere was about how it was impossible for a
> (programming) language to be case-insensitive because there is the
> odd character in one or two languages which has mismatched lower and
> upper case versions.)

Unicode has all sorts of nasty corners that a program with serious
string-handling will have to deal.  Some of the more obvious ones are
cases like "Σ", whose lower case is either "σ" or "ς" depending on
position, and "ß", whose upper case is "ẞ", "SS" or "SZ" depending on
context--and is sometimes equal to "ss" (or "ſs") or "sz" (or "ſz") and
sometimes not.  And then there's the precomposed vs combining characters
mess, scripts where letters are used as numerals, languages where the
same code points have different collating orders, and endless other
insanities.  Encoding is the _least_ of your problems.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]

#77482

From	Keith Thompson <kst-u@mib.org>
Date	2015-11-30 13:23 -0800
Message-ID	<lnh9k3w59i.fsf@kst-u.example.com>
In reply to	#77456

BartC <bc@freeuk.com> writes:
[...]
>> On 30-Nov-15 01:07, Malcolm McLean wrote:
[...]
>>> And Microsoft has gone the UTF-16 route, as has Java.
>
> I suspect that 16-bit strings would be fine in most cases. (A discussion 
> elsewhere was about how it was impossible for a (programming) language 
> to be case-insensitive because there is the odd character in one or two 
> languages which has mismatched lower and upper case versions.)

The cases in which "16-bit strings would be fine" are those that only
use characters within the BMP (Basic Multilingual Plane).  Characters
outside the BMP are exactly why UTF-16 (as opposed to UCS-2) exists.

Why support "most cases" when there are already perfectly good ways to
support all cases?

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#77486

From	BartC <bc@freeuk.com>
Date	2015-11-30 22:32 +0000
Message-ID	<n3iil7$ld7$1@dont-email.me>
In reply to	#77482

On 30/11/2015 21:23, Keith Thompson wrote:
> BartC <bc@freeuk.com> writes:

>> I suspect that 16-bit strings would be fine in most cases. (A discussion
>> elsewhere was about how it was impossible for a (programming) language
>> to be case-insensitive because there is the odd character in one or two
>> languages which has mismatched lower and upper case versions.)
>
> The cases in which "16-bit strings would be fine" are those that only
> use characters within the BMP (Basic Multilingual Plane).

> Characters
> outside the BMP are exactly why UTF-16 (as opposed to UCS-2) exists.

But how common are they, exactly? I understand that /fully/ supporting 
Unicode is full of problems even using UTF32.

I think not being able to randomly index strings containing ancient 
Etruscan text, would be one of the more minor ones.

Meanwhile I still occasionally come across problems with the 
representation of £ or €; maybe they should fix those first before we 
worry about ancient scripts or rare Chinese ideograms.

> Why support "most cases" when there are already perfectly good ways to
> support all cases?

"most" is likely to be 100% of the examples I'm going to come across in 
my lifetime.

Anyway I didn't say use 16-bits; I recommended 32-bits. I'm just saying 
that if someone does use 16-bits for some throwaway, or personal or 
informal software, I'd be surprised if they came across any of these 
rare characters.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#77489

From	Keith Thompson <kst-u@mib.org>
Date	2015-11-30 15:10 -0800
Message-ID	<lnzixvulqc.fsf@kst-u.example.com>
In reply to	#77486

BartC <bc@freeuk.com> writes:
> On 30/11/2015 21:23, Keith Thompson wrote:
>> BartC <bc@freeuk.com> writes:
>
>>> I suspect that 16-bit strings would be fine in most cases. (A discussion
>>> elsewhere was about how it was impossible for a (programming) language
>>> to be case-insensitive because there is the odd character in one or two
>>> languages which has mismatched lower and upper case versions.)
>>
>> The cases in which "16-bit strings would be fine" are those that only
>> use characters within the BMP (Basic Multilingual Plane).
>
>> Characters
>> outside the BMP are exactly why UTF-16 (as opposed to UCS-2) exists.
>
> But how common are they, exactly? I understand that /fully/ supporting 
> Unicode is full of problems even using UTF32.
>
> I think not being able to randomly index strings containing ancient 
> Etruscan text, would be one of the more minor ones.

If ancient Etruscan were the only language affected, you'd have a valid
point.

Supporing only characters within the BMP presents the same problems as
supporting only 7-bit ASCII, or 8-bit Latin-1 (or Latin-N for any of
several values of N).  It just delays those problems a bit longer.

> Meanwhile I still occasionally come across problems with the 
> representation of £ or €; maybe they should fix those first before we 
> worry about ancient scripts or rare Chinese ideograms.

I agree that such problems should be fixed.  I'm guessing that \243 and
\200 are the Windows-1252 representations of the UK pound sign and the
Euro sign, respectively.  Avoiding Windows-1252 would be at least a
partial solution to that.

>> Why support "most cases" when there are already perfectly good ways to
>> support all cases?
>
> "most" is likely to be 100% of the examples I'm going to come across in 
> my lifetime.
>
> Anyway I didn't say use 16-bits; I recommended 32-bits. I'm just saying 
> that if someone does use 16-bits for some throwaway, or personal or 
> informal software, I'd be surprised if they came across any of these 
> rare characters.

Supporting characters that don't fit in 8 bits is hard.  Restricting
support to characters that fit in 16 bits doesn't make it that much
easier.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#77490

From	Stephen Sprunk <stephen@sprunk.org>
Date	2015-11-30 21:05 -0600
Message-ID	<n3j2ke$3do$1@dont-email.me>
In reply to	#77486

On 30-Nov-15 16:32, BartC wrote:
> On 30/11/2015 21:23, Keith Thompson wrote:
>> BartC <bc@freeuk.com> writes:
>>> I suspect that 16-bit strings would be fine in most cases. (A
>>> discussion elsewhere was about how it was impossible for a
>>> (programming) language to be case-insensitive because there is
>>> the odd character in one or two languages which has mismatched
>>> lower and upper case versions.)
>> 
>> The cases in which "16-bit strings would be fine" are those that
>> only use characters within the BMP (Basic Multilingual Plane).
>>
>> Characters outside the BMP are exactly why UTF-16 (as opposed to
>> UCS-2) exists.
> 
> But how common are they, exactly?

That depends on who you are and what you're doing.  Some people deal
with non-BMP characters constantly, others only occasionally, but almost
nobody _never_ encounters them anymore.  That's part of what makes
broken UTF-16 code so increasingly painful to deal with.

> I understand that /fully/ supporting Unicode is full of problems
> even using UTF32.

Indeed; encoding is honestly the least of your problems, so just use
UTF-8 like everyone else and move on to the _hard_ stuff.

> Meanwhile I still occasionally come across problems with the 
> representation of £ or €; maybe they should fix those first before
> we worry about ancient scripts or rare Chinese ideograms.

If you're getting mojibake or replacement characters, that is usually
due to folks using some ancient encoding rather than something modern
and sensible, e.g. UTF-8.

>> Why support "most cases" when there are already perfectly good ways
>> to support all cases?
> 
> "most" is likely to be 100% of the examples I'm going to come across
> in my lifetime.

Take a look again at the non-BMP Unicode blocks before you convince
yourself that you'll never see _any_ of them.  I get at least a dozen a
day just counting emojis in text messages, and I correspond daily with
coworkers in China whose very _names_ use non-BMP characters.

> Anyway I didn't say use 16-bits; I recommended 32-bits. I'm just
> saying that if someone does use 16-bits for some throwaway, or
> personal or informal software, I'd be surprised if they came across
> any of these rare characters.

For personal throwaway code, sure.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]

#77502

From	BartC <bc@freeuk.com>
Date	2015-12-01 12:38 +0000
Message-ID	<n3k48m$9j6$1@dont-email.me>
In reply to	#77490

On 01/12/2015 03:05, Stephen Sprunk wrote:
> On 30-Nov-15 16:32, BartC wrote:

>> I understand that /fully/ supporting Unicode is full of problems
>> even using UTF32.
>
> Indeed; encoding is honestly the least of your problems, so just use
> UTF-8 like everyone else and move on to the _hard_ stuff.

>> Meanwhile I still occasionally come across problems with the
>> representation of £ or €; maybe they should fix those first before
>> we worry about ancient scripts or rare Chinese ideograms.
>
> If you're getting mojibake or replacement characters, that is usually
> due to folks using some ancient encoding rather than something modern
> and sensible, e.g. UTF-8.

This is a typical problem I would get (source code was UTF8):

#include <stdio.h>
#include <string.h>

int main(void) {
char s[]="£100 = €140";
unsigned char c;
int i;

     printf("%s\n",s);

     for (i=0; i<strlen(s); ++i){
         c = s[i];
         printf("%2d: %03d %02X <%c>\n",i,c,c,c);
     }
}

I want to print the individual characters in the string. Compiled with 
gcc, I get (using Windows console set to code page 65001):

£100 = €140
  0: 194 C2 <�>
  1: 163 A3 <�>
  2: 049 31 <1>
  3: 048 30 <0>
  4: 048 30 <0>
  5: 032 20 < >
  6: 061 3D <=>
  7: 032 20 < >
  8: 226 E2 <�>
  9: 130 82 <�>
10: 172 AC <�>
11: 049 31 <1>
12: 052 34 <4>
13: 048 30 <0>

I get 13 'characters' output instead of the 11 I expect. The £ and € 
characters are replaced by sequences of those funny black diamonds (you 
might see some other error character).

Two C compilers even print the first line as:

��100 = ���140

(One or two had a problem with the UTF8 BOM, which I had to remove.)

This is basic stuff. And I'm doing serial access on the string, not random.

So much for the majority of programs being able to work unchanged with 
UTF8! I'd need to start going into the multi-byte and wide char stuff. 
On Windows, that means UCS2 or UTF16 or whatever it is now, which 
apparently isn't good enough either.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#77513

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2015-12-01 14:43 +0000
Message-ID	<87a8pudybv.fsf@bsb.me.uk>
In reply to	#77502

BartC <bc@freeuk.com> writes:

> On 01/12/2015 03:05, Stephen Sprunk wrote:
>> On 30-Nov-15 16:32, BartC wrote:
>
>>> I understand that /fully/ supporting Unicode is full of problems
>>> even using UTF32.
>>
>> Indeed; encoding is honestly the least of your problems, so just use
>> UTF-8 like everyone else and move on to the _hard_ stuff.
>
>>> Meanwhile I still occasionally come across problems with the
>>> representation of £ or €; maybe they should fix those first before
>>> we worry about ancient scripts or rare Chinese ideograms.
>>
>> If you're getting mojibake or replacement characters, that is usually
>> due to folks using some ancient encoding rather than something modern
>> and sensible, e.g. UTF-8.
>
> This is a typical problem I would get (source code was UTF8):
>
> #include <stdio.h>
> #include <string.h>
>
> int main(void) {
> char s[]="£100 = €140";
> unsigned char c;
> int i;
>
>     printf("%s\n",s);
>
>     for (i=0; i<strlen(s); ++i){
>         c = s[i];
>         printf("%2d: %03d %02X <%c>\n",i,c,c,c);
>     }
> }
>
> I want to print the individual characters in the string. Compiled with
> gcc, I get (using Windows console set to code page 65001):
>
> £100 = €140
>  0: 194 C2 <�>
>  1: 163 A3 <�>
>  2: 049 31 <1>
>  3: 048 30 <0>
>  4: 048 30 <0>
>  5: 032 20 < >
>  6: 061 3D <=>
>  7: 032 20 < >
>  8: 226 E2 <�>
>  9: 130 82 <�>
> 10: 172 AC <�>
> 11: 049 31 <1>
> 12: 052 34 <4>
> 13: 048 30 <0>
>
> I get 13 'characters' output instead of the 11 I expect. The £ and €
> characters are replaced by sequences of those funny black diamonds
> (you might see some other error character).

Nothing wrong there.  Your expectation is a little bit off, but
everything seems to be working.  It's a bit of a coincidence that it
works, but UTF-8 often does "just work" like this.  To make it work by
design you need to tell the compiler that the string is UTF-8 (u8"£100 =
€140") and you might need to set a suitable locale for the output.

> Two C compilers even print the first line as:
>
> ��100 = ���140

They probably don't understand UTF-8 source.  If you have the luxury of
a C11 compiler you can us u8"...".  If you have C99 you can use
universal character names (e.g. \u00a3 for the pound sign).

> (One or two had a problem with the UTF8 BOM, which I had to remove.)

There is no such thing.  UTF-8 has no byte order issues, so that
character is taken to be what it really is: a zero width no-break space.
C it permitted to reject a file with zero width no-break space in it and
it's not even obliged to take UTF-8 source.

> This is basic stuff. And I'm doing serial access on the string, not
> random.

You are in danger of complicating this for other people.  Everything is
working correctly in your code (with gcc at least) but it does not match
your expectation.  If you wanted your non-ASCII characters to be single
bytes you can pick any of the dozens of alternative encodings and keep
your fingers crossed that everyone else will know which one you chose.
The growing popularity of Unicode and UTF-8 is making that old "guess
the character I was thinking of" a thing of the past.  Don't keep it
going!

> So much for the majority of programs being able to work unchanged with
> UTF8! I'd need to start going into the multi-byte and wide char
> stuff. On Windows, that means UCS2 or UTF16 or whatever it is now,
> which apparently isn't good enough either.

You've shown one program that appears to be working.  But even it's not,
that tells me nothing about the majority of programs.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#77557

From	Malcolm McLean <malcolm.mclean5@btinternet.com>
Date	2015-12-01 12:09 -0800
Message-ID	<21078681-254f-4af6-8f17-9e967a409f28@googlegroups.com>
In reply to	#77513

On Tuesday, December 1, 2015 at 2:43:14 PM UTC, Ben Bacarisse wrote:
> BartC <bc@freeuk.com> writes:
> 
> You've shown one program that appears to be working.  But even it's not,
> that tells me nothing about the majority of programs.
> 
If you write C source in UTF-8, with some of the string literals
and identifier containing extended characters, doe s it still work?

As far as the identifiers go, it's hit and miss. It depends on the
exact code used for determining a valid identifier. As far as the
string literals go, it depends on the low-level interface to
printf(). If UTF-8 is accepted, the program will work. But most
likely it isn't, and printf will produce odd characters.

So it's not quite true that unless a program works directly with
glyphs, if it is UTF-8 naive, it should work correctly. The average
C compiler is an exception.

[toc] | [prev] | [next] | [standalone]

#77558

From	Ian Collins <ian-news@hotmail.com>
Date	2015-12-02 09:14 +1300
Message-ID	<dc6ddnFi96mU3@mid.individual.net>
In reply to	#77557

Malcolm McLean wrote:
> On Tuesday, December 1, 2015 at 2:43:14 PM UTC, Ben Bacarisse wrote:
>> BartC <bc@freeuk.com> writes:
>>
>> You've shown one program that appears to be working.  But even it's not,
>> that tells me nothing about the majority of programs.
>>
> If you write C source in UTF-8, with some of the string literals
> and identifier containing extended characters, doe s it still work?
>
> As far as the identifiers go, it's hit and miss. It depends on the
> exact code used for determining a valid identifier. As far as the
> string literals go, it depends on the low-level interface to
> printf(). If UTF-8 is accepted, the program will work. But most
> likely it isn't, and printf will produce odd characters.

Does it?  All printf is doing is sending a bunch of bytes to the 
console.  The interpretation of those bytes is handled by the console.

-- 
Ian Collins

[toc] | [prev] | [next] | [standalone]

#77561

From	Malcolm McLean <malcolm.mclean5@btinternet.com>
Date	2015-12-01 12:27 -0800
Message-ID	<f990da1c-ab87-488e-b0d3-9ac21750c20a@googlegroups.com>
In reply to	#77558

On Tuesday, December 1, 2015 at 8:15:05 PM UTC, Ian Collins wrote:
> Malcolm McLean wrote:
> As far as the
> > string literals go, it depends on the low-level interface to
> > printf(). If UTF-8 is accepted, the program will work. But most
> > likely it isn't, and printf will produce odd characters.
> 
> Does it?  All printf is doing is sending a bunch of bytes to the 
> console.  The interpretation of those bytes is handled by the console.
> 
I'd guess that the Windows DOS box (= console) accepts UTF-16 characters
but not UTF-8, and that somewhere in the printf implementation there's
a little routine that pads an ascii character to 16 bits. So it goes
wrong if printf is fed UTF-8, but a change would be trivial, as long
as you stick to the subset of unicode that can be encoded in single
16 bit code points.

[toc] | [prev] | [next] | [standalone]

#77573

From	Ian Collins <ian-news@hotmail.com>
Date	2015-12-02 10:14 +1300
Message-ID	<dc6guiFi96mU4@mid.individual.net>
In reply to	#77561

Malcolm McLean wrote:
> On Tuesday, December 1, 2015 at 8:15:05 PM UTC, Ian Collins wrote:
>> Malcolm McLean wrote:
>> As far as the
>>> string literals go, it depends on the low-level interface to
>>> printf(). If UTF-8 is accepted, the program will work. But most
>>> likely it isn't, and printf will produce odd characters.
>>
>> Does it?  All printf is doing is sending a bunch of bytes to the
>> console.  The interpretation of those bytes is handled by the console.
>>
> I'd guess that the Windows DOS box (= console) accepts UTF-16 characters
> but not UTF-8, and that somewhere in the printf implementation there's
> a little routine that pads an ascii character to 16 bits. So it goes
> wrong if printf is fed UTF-8, but a change would be trivial, as long
> as you stick to the subset of unicode that can be encoded in single
> 16 bit code points.

Any conversion is probably somewhere other than in the printf 
implementation, somewhere in the output driver most likely.

Consider what happens when fprintf is substituted for printf.  What 
would be the output in a DOS box from

   char s[]="£100 = €140";
   printf( "%d\n", fprintf( stdout,"%s\n",s ) );

-- 
Ian Collins

[toc] | [prev] | [next] | [standalone]

#77581

From	Stephen Sprunk <stephen@sprunk.org>
Date	2015-12-01 18:01 -0600
Message-ID	<n3lc7p$gmd$1@dont-email.me>
In reply to	#77561

On 01-Dec-15 14:27, Malcolm McLean wrote:
> On Tuesday, December 1, 2015 at 8:15:05 PM UTC, Ian Collins wrote:
>> Malcolm McLean wrote: As far as the
>>> string literals go, it depends on the low-level interface to 
>>> printf(). If UTF-8 is accepted, the program will work. But most 
>>> likely it isn't, and printf will produce odd characters.
>> 
>> Does it?  All printf is doing is sending a bunch of bytes to the 
>> console.  The interpretation of those bytes is handled by the
>> console.
> 
> I'd guess that the Windows DOS box (= console) accepts UTF-16
> characters but not UTF-8, and that somewhere in the printf
> implementation there's a little routine that pads an ascii character
> to 16 bits. So it goes wrong if printf is fed UTF-8, but a change
> would be trivial, as long as you stick to the subset of unicode that
> can be encoded in single 16 bit code points.

Windows programs can write to the console with either WriteConsoleW(),
which takes a UTF-16LE string, or WriteConsoleA(), which translates the
bytes to a UTF-16LE string according to the current code page and then
(in effect, if not in fact) passes it to WriteConsoleW().  Note that
printf() et al (eventually) call WriteConsoleA().

Console programs can use SetConsoleInputCP() and SetConsoleOutputCP() to
select any supported code page, including UTF-8 (65001), and users can
do the same with the "chcp" command.  Unfortunately, Windows does _not_
allow setting UTF-8 as the default code page, so you have to do this
every time you open a new console window.

S

-- 
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

[toc] | [prev] | [next] | [standalone]

#77567

From	BartC <bc@freeuk.com>
Date	2015-12-01 20:41 +0000
Message-ID	<n3l0hk$u87$1@dont-email.me>
In reply to	#77558

On 01/12/2015 20:14, Ian Collins wrote:
> Malcolm McLean wrote:
>> On Tuesday, December 1, 2015 at 2:43:14 PM UTC, Ben Bacarisse wrote:
>>> BartC <bc@freeuk.com> writes:
>>>
>>> You've shown one program that appears to be working.  But even it's not,
>>> that tells me nothing about the majority of programs.
>>>
>> If you write C source in UTF-8, with some of the string literals
>> and identifier containing extended characters, doe s it still work?
>>
>> As far as the identifiers go, it's hit and miss. It depends on the
>> exact code used for determining a valid identifier. As far as the
>> string literals go, it depends on the low-level interface to
>> printf(). If UTF-8 is accepted, the program will work. But most
>> likely it isn't, and printf will produce odd characters.
>
> Does it?  All printf is doing is sending a bunch of bytes to the
> console.  The interpretation of those bytes is handled by the console.

If I run this code, where it prints the first 4 'somethings' of the string:

     printf("%.4s","£100pw");

Then it outputs "£10" in UTF8, not "£100". £90 is a big difference!

So does that 4 represent bytes or characters?

The specs for printf on MSDN say printf returns the number of characters 
printed, while the C standard says it's the number of characters 
transmitted.

But here it returns 4 for an output of "£10", clearly not 4 characters. 
So it's all a bit of a mess.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#77569

From	Keith Thompson <kst-u@mib.org>
Date	2015-12-01 12:53 -0800
Message-ID	<lnpoypubyy.fsf@kst-u.example.com>
In reply to	#77567

BartC <bc@freeuk.com> writes:
[...]
> If I run this code, where it prints the first 4 'somethings' of the string:
>
>      printf("%.4s","£100pw");
>
> Then it outputs "£10" in UTF8, not "£100". £90 is a big difference!

The pound sign in your article is printed in my newsreader (actually in
GNU Emacs) as \243.  Your article headers include:

    Content-Type: text/plain; charset=windows-1252; format=flowed

Apparently my system (I'm using Linux) isn't configured to understand
windows-1252, so it falls back to displaying the character in octal.

I see you're using Thunderbird on Windows.  Is there any way you can
configure it to post using UTF-8?

Anyway ...

> So does that 4 represent bytes or characters?
>
> The specs for printf on MSDN say printf returns the number of characters 
> printed, while the C standard says it's the number of characters 
> transmitted.
>
> But here it returns 4 for an output of "£10", clearly not 4 characters. 
> So it's all a bit of a mess.

The C standard says that printf returns "the number of characters
transmitted, or a negative value if an output or encoding error
occurred".

It appears to be using the word "character" in the sense defined in
3.7.1:

    character
    single-byte character
    <C> bit representation that fits in a byte

as opposed to 3.7:

    character
    <abstract> member of a set of elements used for the organization,
    control, or representation of data

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

Page 3 of 11 — ← Prev page 1 2 [3] 4 5 … 11 Next page →

csiph-web

Working efficiently with 32-bit Unicode output streams, locale etc.

Contents

#77444 — Re: Working efficiently with 32-bit Unicode output streams, locale etc.

#77447

#77452

#77456

#77460

#77476

#77474

#77482

#77486

#77489

#77490

#77502

#77513

#77557

#77558

#77561

#77573

#77581

#77567

#77569