Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #77683

Re: unicode is a fail

From Keith Thompson <kst-u@mib.org>
Newsgroups comp.lang.c
Subject Re: unicode is a fail
Date 2015-12-02 16:25 -0800
Organization None to speak of
Message-ID <lny4dcqsxu.fsf@kst-u.example.com> (permalink)
References (1 earlier) <22fccd29-addc-4070-8d1d-c3f876f5f12e@googlegroups.com> <9d6e662f-e8eb-4f76-bc92-6d04d7b3eba0@googlegroups.com> <n3n969$k70$1@dont-email.me> <lnoae8sljm.fsf@kst-u.example.com> <n3npg5$os5$1@dont-email.me>

Show all headers | View raw


Stephen Sprunk <stephen@sprunk.org> writes:
> On 02-Dec-15 13:22, Keith Thompson wrote:
>> Stephen Sprunk <stephen@sprunk.org> writes: [...]
>>> UTF-32's simplicity comes at the cost of embedded NUL characters,
>>> so it's inherently incompatible with all existing C string-handling
>>> code. UTF-8 isn't perfect, but at least it is _usually_ compatible,
>>> and it has the side benefits of being endian-neutral and generally
>>> smaller.  UTF-16 takes the worst of both and the best of neither.
>> 
>> UTF-32 has that cost if it's encoded as a sequence of 4 octets per 
>> character.
>
> UTF-32 is, by definition, an encoding of exactly one 32-bit code unit
> per code point.

Surely that 32-bit code unit can be represented by a sequence of 4
octets.  For example, if I type

    echo hello | iconv -f utf-8 -t utf-32 > hello.utf32

I get a file that represents each character as 4 bytes (and that starts
with a 4-byte BOM).

> Depending on what you mean by "character", though, it may require a
> variable number of code points, and there are also code points for
> non-characters.  That's the _real_ problem, so UTF-32's alleged fixed
> width is misleading.

Yes, I was glossing over that issue.

>> If wchar_t is 32 bits, then the standard library functions that
>> handle arrays of wchar_t (wcslen() et al) don't have that problem;
>> only a 32-bit zero is treated as a (wide) null character.
>
> That solves the NUL-terminated string issue, but it also means every
> function with a string argument or return must be replaced or, worse,
> duplicated.  Ouch.

But that's pretty much already done; wcslen() et al are part of the
standard library.

> Worse, all that pain doesn't even solve the _real_ problems!
>
>> On the other hand, MS Windows has 16-bit wchar_t.
>
> That violates the C Standard's requirements, but it's also a popular
> enough platform that to ignore its problems may be unwise.

Does it?  wchar_t is supposed to be "an integer type whose range of
values can represent distinct codes for all members of the largest
extended character set specified among the supported locales".  An
conforming implementation whose largest extended character set has no
more than 65536 characters could legally use 16-bit wchar_t.  I don't
know whether that applies to Microsoft's implementation.

>> On the other other hand, C11 adds char16_t and char32_t.
>
> That seems like a sop to Microsoft.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

Back to comp.lang.c | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 08:01 -0800
  Re: unicode is a fail me <self@example.org> - 2015-12-02 16:12 +0000
    Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:09 -0800
  Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 08:18 -0800
    Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:07 -0800
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:21 -0600
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:40 -0800
        Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:22 -0800
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:59 -0600
            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:25 -0800
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 19:47 -0600
          Re: unicode is a fail supercat@casperkitty.com - 2015-12-02 14:38 -0800
            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 16:26 -0800
              Re: unicode is a fail Tim Rentsch <txr@alumni.caltech.edu> - 2015-12-09 11:33 -0800
                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:21 -0800
        Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 11:28 +0100
          Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 08:50 -0600
            Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:38 +0100
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 10:01 -0600
            Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-03 09:46 -0800
            Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:39 +0000
          Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 08:26 -0800
            Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-03 18:42 +0000
              Re: unicode is a fail supercat@casperkitty.com - 2015-12-03 17:14 -0800
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 19:02 -0800
                Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-04 06:35 +0000
                Re: unicode is a fail David Thompson <dave.thompson2@verizon.net> - 2015-12-28 05:11 -0500
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 10:24 -0600
            Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-03 22:37 +0000
              Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 11:32 +0100
    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 11:10 -0600
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 09:24 -0800
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 13:10 -0600
          Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-02 19:45 +0000
            Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-03 09:08 +1300
            Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 14:10 -0600
      Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 11:27 -0800
        Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 15:21 -0600
          Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 15:18 -0800
            Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:45 +0000
    Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 09:43 -0800
      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 11:40 -0800
        Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-02 12:19 -0800
      Re: unicode is a fail Nobody <nobody@nowhere.invalid> - 2015-12-02 21:23 +0000
    Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 10:12 +0100
      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 02:13 -0800
        Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:11 +0100
          Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 05:17 -0800
            Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 15:33 +0100
              Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:05 -0800
                Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 16:42 +0100
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-03 07:58 -0800
      Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-03 10:38 +0000
        Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-03 14:17 +0100
      Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:54 +0000
        Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-04 14:25 +0100
          Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-04 13:46 +0000
  Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-02 23:24 +0000
    Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-03 00:45 +0000
      Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-02 20:59 -0600
      Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-02 19:13 -0800
      Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-03 07:00 +0000
        Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 04:45 -0800
          Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:04 +0000
        Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 13:22 +0000
          Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-04 07:35 -0800
          Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 19:17 +0000
            Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 11:49 -0800
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 15:39 -0600
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-04 14:19 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 12:57 -0600
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-06 15:47 -0800
              Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:13 +0000
                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-05 01:59 +0000
                Re: unicode is a fail David Brown <david.brown@hesbynett.no> - 2015-12-05 17:17 +0100
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:28 +0000
            Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-04 23:46 +0000
              Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-05 01:04 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 03:21 -0800
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:03 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 11:47 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 04:40 -0800
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-05 13:26 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-05 13:35 -0600
                Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-06 02:23 +0000
                Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 16:09 +0530
                Re: unicode is a fail Xavier <zaz.colmant@free.fr> - 2015-12-05 15:45 +0100
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 07:42 -0800
                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-05 16:32 -0800
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 18:11 -0800
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 02:19 +0000
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-06 13:09 +0000
                Re: unicode is a fail Martin Shobe <martin.shobe@yahoo.com> - 2015-12-06 18:38 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 01:55 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 19:14 -0800
                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 13:53 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 06:31 -0800
                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-07 21:22 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 15:34 -0600
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-07 16:36 -0800
                Re: unicode is a fail Lowell Gilbert <lgusenet@be-well.ilk.org> - 2015-12-08 11:40 -0500
                Re: unicode is a fail Ben Bacarisse <ben.usenet@bsb.me.uk> - 2015-12-08 17:18 +0000
                Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-09 08:36 -0600
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 10:06 -0600
                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 09:35 -0800
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 10:07 -0800
                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:04 -0800
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 12:35 -0800
                Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-09 23:46 +0000
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-09 16:15 -0800
                Re: unicode is a fail glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2015-12-10 03:49 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-09 18:12 -0600
                Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-09 13:12 -0500
                Re: unicode is a fail Keith Thompson <kst-u@mib.org> - 2015-12-09 12:12 -0800
                Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-10 20:48 +0000
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-09 23:44 +0000
                Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 01:13 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-10 10:39 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-10 03:33 -0800
                Re: unicode is a fail supercat@casperkitty.com - 2015-12-10 06:07 -0800
                Re: unicode is a fail "Osmium" <r124c4u102@comcast.net> - 2015-12-10 08:21 -0600
                Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-10 00:59 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 14:33 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-06 22:45 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 12:38 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 13:55 -0600
                Re: unicode is a fail BartC <bc@freeuk.com> - 2015-12-07 21:14 +0000
                Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-07 16:50 -0600
                Re: unicode is a fail Robert Wessel <robertwessel2@yahoo.com> - 2015-12-07 02:38 -0600
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 07:34 +0000
                Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-06 00:24 -0800
              Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-04 19:49 -0600
            Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 21:32 +0000
              Re: unicode is a fail Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:50 -0800
                Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 22:15 +0000
                Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 17:27 -0500
                Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:06 +0000
                Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 18:29 -0500
                Re: unicode is a fail Richard Heathfield <rjh@cpax.org.uk> - 2015-12-05 23:50 +0000
                Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:38 +0000
                Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:33 +0000
              Re: unicode is a fail James Kuyper <jameskuyper@verizon.net> - 2015-12-05 16:51 -0500
              Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 10:59 +1300
                Re: unicode is a fail Ian Collins <ian-news@hotmail.com> - 2015-12-06 11:00 +1300
              Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-06 06:31 +0000
    Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-02 17:48 -0800
      Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 01:20 -0800
        Re: unicode is a fail fir <profesor.fir@gmail.com> - 2015-12-03 02:02 -0800
    Re: unicode is a fail Stephen Sprunk <stephen@sprunk.org> - 2015-12-03 09:43 -0600
    Re: unicode is a fail raltbos@xs4all.nl (Richard Bos) - 2015-12-04 12:55 +0000
      Re: unicode is a fail Steve Thompson <stevet810@gmail.com> - 2015-12-04 18:29 +0000
        Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 16:42 +0000
    Re: unicode is a fail Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-05 10:06 +0000
      OT: Usenet (Was: unicode is a fail) Steve Thompson <stevet810@gmail.com> - 2015-12-05 20:41 +0000
        Re: OT: Usenet (Was: unicode is a fail) Malcolm McLean <malcolm.mclean5@btinternet.com> - 2015-12-05 13:18 -0800
      Re: unicode is a fail Udyant Wig <udyantw@gmail.com> - 2015-12-06 10:21 +0530
        OT: Facebook (was Re: unicode is a fail) Jorgen Grahn <grahn+nntp@snipabacken.se> - 2015-12-06 08:51 +0000
          Re: OT: Facebook (was Re: unicode is a fail) raltbos@xs4all.nl (Richard Bos) - 2015-12-06 13:45 +0000

csiph-web