Groups > comp.lang.c > #395270 > unrolled thread

Unicode...

Started by	Michael Sanders <porkchop@invalid.foo>
First post	2025-11-14 21:03 +0000
Last post	2025-11-23 22:05 +0000
Articles	20 on this page of 93 — 17 participants

Back to article view | Back to comp.lang.c

  Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 21:03 +0000
    Re: Unicode... Kaz Kylheku <643-408-1753@kylheku.com> - 2025-11-14 21:20 +0000
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 21:46 +0000
        Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 16:12 -0800
          Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 00:46 +0000
            Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 18:47 -0800
              Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:10 +0000
                Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-15 13:51 -0800
                  Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 22:31 +0000
    Re: Unicode... richard@cogsci.ed.ac.uk (Richard Tobin) - 2025-11-14 23:23 +0000
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 23:51 +0000
    Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 16:11 -0800
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 00:49 +0000
    Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-15 05:51 +0100
      Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-15 06:24 +0100
        Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:28 +0000
          Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-19 11:56 +0100
            Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-21 02:21 +0000
              Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-21 11:10 +0100
        Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-16 00:38 +0000
        Re: Unicode... bart <bc@freeuk.com> - 2025-11-21 17:03 +0000
          Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-21 17:39 +0000
          Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 06:39 +0100
            Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 11:55 +0000
              Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 14:10 +0100
                Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 13:38 +0000
                  Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 15:08 +0100
                    Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 14:28 +0000
                      Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 15:51 +0100
                      Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 16:05 +0100
                        Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 16:35 +0000
                          Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 18:13 +0100
                            Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 17:35 +0000
                              Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 17:39 +0000
                                Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-22 15:24 -0800
                                  Re: Unicode... bart <bc@freeuk.com> - 2025-11-23 00:14 +0000
                                    Re: Unicode... David Brown <david.brown@hesbynett.no> - 2025-11-23 13:32 +0100
                              Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 18:44 +0100
                                Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 19:28 +0000
                                  Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 20:59 +0100
                                  Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-26 19:42 +0100
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:06 +0000
    Re: Unicode... Mikko <mikko.levanto@iki.fi> - 2025-11-15 12:47 +0200
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:09 +0000
        Re: Unicode... Mikko <mikko.levanto@iki.fi> - 2025-11-16 11:22 +0200
    Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:14 +0000
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 20:16 +0000
    Unicode Sorting (Was Re: Unicode...) Michael Sanders <porkchop@invalid.foo> - 2025-11-16 20:30 +0000
      Re: Unicode Sorting (Was Re: Unicode...) Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-16 16:13 -0800
    Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-17 23:49 +0000
      Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-11-18 14:27 -0500
        Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-18 20:17 +0000
          Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-18 20:40 +0000
          Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-11-19 09:08 -0500
            Re: Unicode... Michael Bäuerle <michael.baeuerle@stz-e.de> - 2025-11-19 15:29 +0100
            Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-19 19:22 +0000
            Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-26 02:03 +0000
          Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-03 06:24 +0100
            Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-12-03 18:33 +0000
              Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-03 14:01 -0500
                Re: Unicode... bart <bc@freeuk.com> - 2025-12-03 20:15 +0000
                  Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-03 22:43 +0200
                  Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-03 12:49 -0800
                    Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-03 18:15 -0800
                Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-12-03 23:23 +0000
                Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-04 14:15 +0100
              Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-04 14:03 +0100
            Binary Search Trees (Was Re: Unicode...) Michael Sanders <porkchop@invalid.foo> - 2025-12-04 04:11 +0000
        Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-24 06:17 +0000
          Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-23 22:22 -0800
            Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-24 01:41 -0600
              Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-24 11:24 +0200
              Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-24 17:11 +0000
                Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-25 02:00 -0600
                  Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-25 10:49 +0200
                    Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-25 10:22 +0100
                    Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-26 16:28 +0000
                      Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-27 00:25 -0600
                        Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:34 +0000
                    Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-27 00:29 -0600
                      Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-27 18:08 +0200
                        Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:38 +0000
                      Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-27 19:17 +0000
                        Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-27 20:47 +0100
                          Re: Unicode... Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2025-12-27 20:03 +0000
                            Re: Unicode... Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2025-12-27 20:05 +0000
                              Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:39 +0000
                            Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-27 22:43 +0100
          Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-31 18:04 -0500
            Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-31 23:11 +0000
              Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-31 18:36 -0500
    Re: Unicode... Philipp Klaus Krause <pkk@spth.de> - 2025-11-23 12:42 +0100
      Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-23 22:05 +0000

Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →

#395672

From	bart <bc@freeuk.com>
Date	2025-12-03 20:15 +0000
Message-ID	<10gq5o5$3kjac$1@dont-email.me>
In reply to	#395671

On 03/12/2025 19:01, James Kuyper wrote:
> On 2025-12-03 13:33, Michael Sanders wrote:
> ...
>> We want portability across diverse OSs. In my case, the program
>> does NOT care what the character is, it simply needs to be able
>> to find it when searching data & displaying it in an ordered way.
>>
>> The code below works perfectly:
>>
>> #include <stdio.h>
>> #include <string.h>
>>
>> int utf8_display_width(const char *s) {
>> int w = 0;
>>
>> while (*s) {
>> unsigned char b = *s;
>> unsigned cp;
>> int n;
>>
>> // UTF-8 decoder
>> if (b <= 0x7F) { // 1-byte ASCII
>> cp = b;
>> n = 1;
>> } else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
>> cp = ((b & 0x1F) << 6) |
>> (s[1] & 0x3F);
>> n = 2;
>> } else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
>> cp = ((b & 0x0F) << 12) |
>> ((s[1] & 0x3F) << 6) |
>> (s[2] & 0x3F);
>> n = 3;
>> } else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
>> cp = ((b & 0x07) << 18) |
>> ((s[1] & 0x3F) << 12) |
>> ((s[2] & 0x3F) << 6) |
>> (s[3] & 0x3F);
>> n = 4;
>> } else { // invalid, treat as 1-byte
>> cp = b;
>> n = 1;
>> }
>>
>> // display width
>> if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
>> width)
>> else if ( // double-width characters...
>> (cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
>> (cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
>> (cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
>> (cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
>> (cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
>> ) { w += 2; }
>> // exceptional wide characters (unicode requirement I've read elsewhere)
>> else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
>> else { w += 1; } // normal width for everything else
>>
>> s += n;
>> }
>>
>> return w;
>> }
>>
>> int main(void) {
>> const char *tests[] = {
>> "hello",
>> "Café",
>> "漢字",
>> "✓",
>> "🙂",
>> NULL
>> };
>>
>> // find maximum display width in 1st column
>> int maxw = 0;
>> for (int i = 0; tests[i]; i++) {
>> int w = utf8_display_width(tests[i]);
>> if (w > maxw) maxw = w;
>> }
>>
>> // total padding after each 1st column + 3 spaces
>> int total_pad = maxw + 3;
>>
>> for (int i = 0; tests[i]; i++) {
>> int w = utf8_display_width(tests[i]);
>> int sl = strlen(tests[i]);
>> printf("%s", tests[i]);
>> int pad = total_pad - w;
>> while (pad-- > 0) putchar(' ');
>> printf("strlen: %d utf8 display width: %d\n", sl, w);
>> }
>>
>> return 0;
>> }
>>
>> // eof
> 
> 
> I find it confusing that this is supposed to "work perfectly" "across
> diverse OSs". The amount of space that a character takes up varies
> depending upon the installed fonts, especially on whether the font is
> monospaced or proportional. Those fonts can be different for display on
> screen or on a printer. I don't see any query to determine even what the
> current font is, much less what it's characteristics are. I don't know
> of any OS-independent way of collecting such information. Does this
> solution "work perfectly" only for your own particular favorite font?


This looks like a solution for a fixed-pitch font. I get this output for 
a Windows console display (with - used for space):

hello---strlen: 5  utf8 display width: 5
Café----strlen: 5  utf8 display width: 4
漢字----strlen: 6  utf8 display width: 4
✓-------strlen: 3  utf8 display width: 1
🙂------strlen: 4  utf8 display width: 2

I was hoping this would be lined up, but already, in a Thunderbird edit 
Window, the last lines aren't lined up properly.

Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch).

It only looks alright in Windows and WSL consoles/terminals. But maybe 
that's all that's needed.

[toc] | [prev] | [next] | [standalone]

#395673

From	Michael S <already5chosen@yahoo.com>
Date	2025-12-03 22:43 +0200
Message-ID	<20251203224305.00004d8e@yahoo.com>
In reply to	#395672

On Wed, 3 Dec 2025 20:15:02 +0000
bart <bc@freeuk.com> wrote:
> 
> 
> This looks like a solution for a fixed-pitch font. I get this output
> for a Windows console display (with - used for space):
> 
> hello---strlen: 5  utf8 display width: 5
> Café----strlen: 5  utf8 display width: 4

It sounds as a luck. é in your text just happened to be encoded as
U+00E9. What if it was encoded as U+0065,U+00B4 ? (Hopefully, I got the
correct code, I can't really distinguish between similar diacritics).


> 漢字----strlen: 6  utf8 display width: 4
> ✓-------strlen: 3  utf8 display width: 1
> 🙂------strlen: 4  utf8 display width: 2
> 
> I was hoping this would be lined up, but already, in a Thunderbird
> edit Window, the last lines aren't lined up properly.
> 
> Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch).
> 
> It only looks alright in Windows and WSL consoles/terminals. But
> maybe that's all that's needed.
> 
> 
>

[toc] | [prev] | [next] | [standalone]

#395674

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2025-12-03 12:49 -0800
Message-ID	<87bjkfnve4.fsf@example.invalid>
In reply to	#395672

bart <bc@freeuk.com> writes:
> On 03/12/2025 19:01, James Kuyper wrote:
[...]
>> I find it confusing that this is supposed to "work perfectly"
>> "across
>> diverse OSs". The amount of space that a character takes up varies
>> depending upon the installed fonts, especially on whether the font is
>> monospaced or proportional. Those fonts can be different for display on
>> screen or on a printer. I don't see any query to determine even what the
>> current font is, much less what it's characteristics are. I don't know
>> of any OS-independent way of collecting such information. Does this
>> solution "work perfectly" only for your own particular favorite font?
>
> This looks like a solution for a fixed-pitch font. I get this output
> for a Windows console display (with - used for space):
[...]

I think bart is right that this is specific to fixed-width fonts.
For a variable width font, 'W' is going to be wider than '|'.

See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
the "number of column positions of a wide-character code".  It does
depend on the current locale.

The assumption seems to be that fixed-width fonts are expected to be
consistent about the widths of characters.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#395679

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2025-12-03 18:15 -0800
Message-ID	<877bv3ngad.fsf@example.invalid>
In reply to	#395674

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
> bart <bc@freeuk.com> writes:
>> On 03/12/2025 19:01, James Kuyper wrote:
> [...]
>>> I find it confusing that this is supposed to "work perfectly"
>>> "across
>>> diverse OSs". The amount of space that a character takes up varies
>>> depending upon the installed fonts, especially on whether the font is
>>> monospaced or proportional. Those fonts can be different for display on
>>> screen or on a printer. I don't see any query to determine even what the
>>> current font is, much less what it's characteristics are. I don't know
>>> of any OS-independent way of collecting such information. Does this
>>> solution "work perfectly" only for your own particular favorite font?
>>
>> This looks like a solution for a fixed-pitch font. I get this output
>> for a Windows console display (with - used for space):
> [...]
>
> I think bart is right that this is specific to fixed-width fonts.
> For a variable width font, 'W' is going to be wider than '|'.
>
> See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
> the "number of column positions of a wide-character code".  It does
> depend on the current locale.
>
> The assumption seems to be that fixed-width fonts are expected to be
> consistent about the widths of characters.

And in fact Unicode specifies how many cell positions each printable
character occupies, or at least for some of them.

The following is quoted from wcwidth.c in the xterm sources.  The text
was originally written by Markus Kuhn.

 * For some graphical characters, the Unicode standard explicitly
 * defines a character-cell width via the definition of the East Asian
 * FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes.
 * In all these cases, there is no ambiguity about which width a
 * terminal shall use. For characters in the East Asian Ambiguous (A)
 * class, the width choice depends purely on a preference of backward
 * compatibility with either historic CJK or Western practice.
 * Choosing single-width for these characters is easy to justify as
 * the appropriate long-term solution, as the CJK practice of
 * displaying these characters as double-width comes from historic
 * implementation simplicity (8-bit encoded characters were displayed
 * single-width and 16-bit ones double-width, even for Greek,
 * Cyrillic, etc.) and not any typographic considerations.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#395675

From	Michael Sanders <porkchop@invalid.foo>
Date	2025-12-03 23:23 +0000
Message-ID	<10gqgpi$3or1n$1@dont-email.me>
In reply to	#395671

On Wed, 3 Dec 2025 14:01:38 -0500, James Kuyper wrote:

> I find it confusing that this is supposed to "work perfectly" "across
> diverse OSs". The amount of space that a character takes up varies
> depending upon the installed fonts, especially on whether the font is
> monospaced or proportional. Those fonts can be different for display on
> screen or on a printer. I don't see any query to determine even what the
> current font is, much less what it's characteristics are. I don't know
> of any OS-independent way of collecting such information. Does this
> solution "work perfectly" only for your own particular favorite font?

Just for use in the terminal & yes it works as advertised.

In my case I simply need to match the character the user passed
to the program when searching for a record. I dont want or need
to know what font is used. If the terminal can display it, then
I want to use it.

Example, user invokes: tinybase -s=漢字 data/*.tbf

Output is...

FILE:  data/history.tbf
LINE:  170
BLOCK: 4
CRC-8: 0x30
QUERY: 漢字
MATCH: 漢字

TAGS: China, History, <漢字>, [wrap:66]

Ancient China...

1. Geography and Early Beginnings: Ancient China, a cradle of
civilization, evolved along the Yellow River's fertile plains.
Protected by the Himalayas to the south, the Gobi Desert to the
north, and vast seas to the east, this geographic isolation
allowed for a unique and continuous cultural development spanning
millennia.

...

James, earnestly intending no offense - add something to the
conversion rather than complaining - I want to learn & solve
problems that's where I'm seeking help. Just modify the code,
make it get closer to your ideal. We'll all benefit.

-- 
:wq
Mike Sanders

[toc] | [prev] | [next] | [standalone]

#395682

From	Bonita Montero <Bonita.Montero@gmail.com>
Date	2025-12-04 14:15 +0100
Message-ID	<10gs1gd$a1f1$1@raubtier-asyl.eternal-september.org>
In reply to	#395671

Am 03.12.2025 um 20:01 schrieb James Kuyper:
> I find it confusing that this is supposed to "work perfectly" "across
> diverse OSs". The amount of space that a character takes up varies
> depending upon the installed fonts, especially on whether the font is
> monospaced or proportional. Those fonts can be different for display on
> screen or on a printer. I don't see any query to determine even what the
> current font is, much less what it's characteristics are. I don't know
> of any OS-independent way of collecting such information. Does this
> solution "work perfectly" only for your own particular favorite font?
Can C handle that with those means given by the standard itself.
And is this really necessary to consider. Consoles are almost always
fixed space. I guess the standard output for an laser printer in line
printed mode is also fixed space.

[toc] | [prev] | [next] | [standalone]

#395681

From	Bonita Montero <Bonita.Montero@gmail.com>
Date	2025-12-04 14:03 +0100
Message-ID	<10gs0qg$9mjl$1@raubtier-asyl.eternal-september.org>
In reply to	#395670

Am 03.12.2025 um 19:33 schrieb Michael Sanders:
> On Wed, 3 Dec 2025 06:24:23 +0100, Bonita Montero wrote:
>
>>> Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
>> Windows has the ...W() APIs along with codepage-based APIs with
>> the ...A() Suffix. The W()-APIs support UTF-16, so no need for
> Hi Bonita.
>
> Yes that's correct, but...
>
> - that assumes we know in advance what the character is
>
> - it would only work under Windows
>
> We want portability across diverse OSs. In my case, the program
> does NOT care what the character is, it simply needs to be able
> to find it when searching data & displaying it in an ordered way.
VC++ supports C- and C++ locale if you like to have it portable.
Especially the locale-support in C++ with its facets is very nice
to handle: https://en.cppreference.com/w/cpp/locale.html

>
> The code below works perfectly:
>
> #include <stdio.h>
> #include <string.h>
>
> int utf8_display_width(const char *s) {
>      int w = 0;
>
>      while (*s) {
>          unsigned char b = *s;
>          unsigned cp;
>          int n;
>
>          // UTF-8 decoder
>          if (b <= 0x7F) { // 1-byte ASCII
>              cp = b;
>              n  = 1;
>          } else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
>              cp = ((b & 0x1F) << 6) |
>                   (s[1] & 0x3F);
>              n  = 2;
>          } else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
>              cp = ((b & 0x0F) << 12)   |
>                   ((s[1] & 0x3F) << 6) |
>                   (s[2] & 0x3F);
>               n = 3;
>          } else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
>              cp = ((b & 0x07) << 18)    |
>                   ((s[1] & 0x3F) << 12) |
>                   ((s[2] & 0x3F) << 6)  |
>                   (s[3] & 0x3F);
>               n = 4;
>          } else { // invalid, treat as 1-byte
>              cp = b;
>              n  = 1;
>          }
>
>          // display width
>          if (cp >= 0x0300 && cp <= 0x036F) {}   // combining marks like é (zero width)
>          else if (                              // double-width characters...
>              (cp >= 0x1100  && cp <= 0x115F) || // hangul jamo
>              (cp >= 0x2E80  && cp <= 0xA4CF) || // cjk radicals & unified ideographs
>              (cp >= 0xAC00  && cp <= 0xD7A3) || // hangul syllables
>              (cp >= 0xF900  && cp <= 0xFAFF) || // cjk compatibility ideographs
>              (cp >= 0x1F300 && cp <= 0x1FAFF)   // emoji + symbols
>          ) { w += 2; }
>          // exceptional wide characters (unicode requirement I've read elsewhere)
>          else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
>          else { w += 1; } // normal width for everything else
>
>          s += n;
>      }
>
>      return w;
> }
>
> int main(void) {
>      const char *tests[] = {
>          "hello",
>          "Café",
>          "漢字",
>          "✓",
>          "🙂",
>          NULL
>      };
>
>      // find maximum display width in 1st column
>      int maxw = 0;
>      for (int i = 0; tests[i]; i++) {
>          int w = utf8_display_width(tests[i]);
>          if (w > maxw) maxw = w;
>      }
>
>      // total padding after each 1st column + 3 spaces
>      int total_pad = maxw + 3;
>
>      for (int i = 0; tests[i]; i++) {
>          int w = utf8_display_width(tests[i]);
>          int sl = strlen(tests[i]);
>          printf("%s", tests[i]);
>          int pad = total_pad - w;
>          while (pad-- > 0) putchar(' ');
>          printf("strlen: %d  utf8 display width: %d\n", sl, w);
>      }
>
>      return 0;
> }
>
> // eof
>

[toc] | [prev] | [next] | [standalone]

#395680 — Binary Search Trees (Was Re: Unicode...)

From	Michael Sanders <porkchop@invalid.foo>
Date	2025-12-04 04:11 +0000
Subject	Binary Search Trees (Was Re: Unicode...)
Message-ID	<10gr1ln$3uldg$1@dont-email.me>
In reply to	#395663

Ever worked with binary search trees Bonita?

I've been playing around with them, or was awhile back at least...

My criteria was to build nodes alphabetically:

- Left subtree contains keys less than the node

- Right subtree contains keys greater than the node

INSTRUMENTATION

    I
   / \
  E   N
 /   / \
A   M   S
   /   / \
  I   R   T
     /     \
    N       U
     \     /
      O   T
     /     \
    N       T

-- 
:wq
Mike Sanders

[toc] | [prev] | [next] | [standalone]

#395930

From	Lawrence D’Oliveiro <ldo@nz.invalid>
Date	2025-12-24 06:17 +0000
Message-ID	<10ig0i8$qjm9$3@dont-email.me>
In reply to	#395301

On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:

> Could you identify which document guarantees that every Unicode locale
> contains "UTF-8"?

How else would it work? Bytes have to be 8-bit.

[toc] | [prev] | [next] | [standalone]

#395931

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2025-12-23 22:22 -0800
Message-ID	<878qes8kll.fsf@example.invalid>
In reply to	#395930

Lawrence D’Oliveiro <ldo@nz.invalid> writes:
> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>> Could you identify which document guarantees that every Unicode locale
>> contains "UTF-8"?
>
> How else would it work? Bytes have to be 8-bit.

I can't figure out what point you're trying to make.

Obviously bytes in C have to be *at least* 8 bits, but I don't see
the relevance.

Take a look at the article to which you replied.  How does your
followup have anything to do with it?

One of several points that you snipped is that locale names can
contain the string "utf8", not "UTF-8".

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#395932

From	Lynn McGuire <lynnmcguire5@gmail.com>
Date	2025-12-24 01:41 -0600
Message-ID	<10ig5fa$rra5$1@dont-email.me>
In reply to	#395931

On 12/24/2025 12:22 AM, Keith Thompson wrote:
> Lawrence D’Oliveiro <ldo@nz.invalid> writes:
>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>>> Could you identify which document guarantees that every Unicode locale
>>> contains "UTF-8"?
>>
>> How else would it work? Bytes have to be 8-bit.
> 
> I can't figure out what point you're trying to make.
> 
> Obviously bytes in C have to be *at least* 8 bits, but I don't see
> the relevance.
> 
> Take a look at the article to which you replied.  How does your
> followup have anything to do with it?
> 
> One of several points that you snipped is that locale names can
> contain the string "utf8", not "UTF-8".

Did C never work on the 6 bit machines such as the Univac 1108 (36 bit) 
or the CDC 7600 (60 bit) ?

Lynn

[toc] | [prev] | [next] | [standalone]

#395937

From	Michael S <already5chosen@yahoo.com>
Date	2025-12-24 11:24 +0200
Message-ID	<20251224112404.000015df@yahoo.com>
In reply to	#395932

On Wed, 24 Dec 2025 01:41:30 -0600
Lynn McGuire <lynnmcguire5@gmail.com> wrote:

> On 12/24/2025 12:22 AM, Keith Thompson wrote:
> > Lawrence D’Oliveiro <ldo@nz.invalid> writes:  
> >> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:  
> >>> Could you identify which document guarantees that every Unicode
> >>> locale contains "UTF-8"?  
> >>
> >> How else would it work? Bytes have to be 8-bit.  
> > 
> > I can't figure out what point you're trying to make.
> > 
> > Obviously bytes in C have to be *at least* 8 bits, but I don't see
> > the relevance.
> > 
> > Take a look at the article to which you replied.  How does your
> > followup have anything to do with it?
> > 
> > One of several points that you snipped is that locale names can
> > contain the string "utf8", not "UTF-8".  
> 
> Did C never work on the 6 bit machines such as the Univac 1108 (36
> bit) or the CDC 7600 (60 bit) ?
> 
> Lynn
> 

It depends on definition of the word C.
The requirement for CHAR_BIT > 7 was not present in K&R C. IIRC, it
first came in C90.

Also, what prevents C90 compiler from using 36-bit char on Univac 1108
and 60-bit bytes on CDC 7600? Methinks, it would be very reasonable.
By chance, that* was a choice made both by TI and by Analog for C
compilers of their word-addressable DSPs.


* - not specifically 36 or 60 bits, but CHAR_BIT = native word width.

[toc] | [prev] | [next] | [standalone]

#395952

From	scott@slp53.sl.home (Scott Lurndal)
Date	2025-12-24 17:11 +0000
Message-ID	<j9V2R.199268$79B9.139603@fx14.iad>
In reply to	#395932

Lynn McGuire <lynnmcguire5@gmail.com> writes:
>On 12/24/2025 12:22 AM, Keith Thompson wrote:
>> Lawrence D’Oliveiro <ldo@nz.invalid> writes:
>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>>>> Could you identify which document guarantees that every Unicode locale
>>>> contains "UTF-8"?
>>>
>>> How else would it work? Bytes have to be 8-bit.
>> 
>> I can't figure out what point you're trying to make.
>> 
>> Obviously bytes in C have to be *at least* 8 bits, but I don't see
>> the relevance.
>> 
>> Take a look at the article to which you replied.  How does your
>> followup have anything to do with it?
>> 
>> One of several points that you snipped is that locale names can
>> contain the string "utf8", not "UTF-8".
>
>Did C never work on the 6 bit machines such as the Univac 1108 (36 bit) 

Yes, there is a C compiler for the Univac machines.   The byte size is
9 bits.

[toc] | [prev] | [next] | [standalone]

#395965

From	Lynn McGuire <lynnmcguire5@gmail.com>
Date	2025-12-25 02:00 -0600
Message-ID	<10iiquh$1n1it$1@dont-email.me>
In reply to	#395952

On 12/24/2025 11:11 AM, Scott Lurndal wrote:
> Lynn McGuire <lynnmcguire5@gmail.com> writes:
>> On 12/24/2025 12:22 AM, Keith Thompson wrote:
>>> Lawrence D’Oliveiro <ldo@nz.invalid> writes:
>>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>>>>> Could you identify which document guarantees that every Unicode locale
>>>>> contains "UTF-8"?
>>>>
>>>> How else would it work? Bytes have to be 8-bit.
>>>
>>> I can't figure out what point you're trying to make.
>>>
>>> Obviously bytes in C have to be *at least* 8 bits, but I don't see
>>> the relevance.
>>>
>>> Take a look at the article to which you replied.  How does your
>>> followup have anything to do with it?
>>>
>>> One of several points that you snipped is that locale names can
>>> contain the string "utf8", not "UTF-8".
>>
>> Did C never work on the 6 bit machines such as the Univac 1108 (36 bit)
> 
> Yes, there is a C compiler for the Univac machines.   The byte size is
> 9 bits.

I get the feeling that you are messing with me.  That would be four 9 
bit characters per 36 bit word.

But the machinations to store that unnatural 9 bits would be crazy.  I 
doubt that would be supported in hardware.

Lynn

[toc] | [prev] | [next] | [standalone]

#395967

From	Michael S <already5chosen@yahoo.com>
Date	2025-12-25 10:49 +0200
Message-ID	<20251225104901.00005fb1@yahoo.com>
In reply to	#395965

On Thu, 25 Dec 2025 02:00:16 -0600
Lynn McGuire <lynnmcguire5@gmail.com> wrote:

> On 12/24/2025 11:11 AM, Scott Lurndal wrote:
> > Lynn McGuire <lynnmcguire5@gmail.com> writes:  
> >> On 12/24/2025 12:22 AM, Keith Thompson wrote:  
> >>> Lawrence D’Oliveiro <ldo@nz.invalid> writes:  
> >>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:  
> >>>>> Could you identify which document guarantees that every Unicode
> >>>>> locale contains "UTF-8"?  
> >>>>
> >>>> How else would it work? Bytes have to be 8-bit.  
> >>>
> >>> I can't figure out what point you're trying to make.
> >>>
> >>> Obviously bytes in C have to be *at least* 8 bits, but I don't see
> >>> the relevance.
> >>>
> >>> Take a look at the article to which you replied.  How does your
> >>> followup have anything to do with it?
> >>>
> >>> One of several points that you snipped is that locale names can
> >>> contain the string "utf8", not "UTF-8".  
> >>
> >> Did C never work on the 6 bit machines such as the Univac 1108 (36
> >> bit)  
> > 
> > Yes, there is a C compiler for the Univac machines.   The byte size
> > is 9 bits.  
> 
> I get the feeling that you are messing with me.  That would be four 9 
> bit characters per 36 bit word.
> 
> But the machinations to store that unnatural 9 bits would be crazy.
> I doubt that would be supported in hardware.
> 
> Lynn
> 

Does not the same apply even stronger to your original suggestion to
use 6-bit characters?

[toc] | [prev] | [next] | [standalone]

#395970

From	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date	2025-12-25 10:22 +0100
Message-ID	<10iivo3$25ihi$9@dont-email.me>
In reply to	#395967

On 2025-12-25 09:49, Michael S wrote:
> On Thu, 25 Dec 2025 02:00:16 -0600
> Lynn McGuire <lynnmcguire5@gmail.com> wrote:
> 
>> On 12/24/2025 11:11 AM, Scott Lurndal wrote:
>>> Lynn McGuire <lynnmcguire5@gmail.com> writes:
>>>>
>>>> Did C never work on the 6 bit machines such as the Univac 1108 (36
>>>> bit)
>>>
>>> Yes, there is a C compiler for the Univac machines.   The byte size
>>> is 9 bits.
>>
>> I get the feeling that you are messing with me.  That would be four 9
>> bit characters per 36 bit word.
>>
>> But the machinations to store that unnatural 9 bits would be crazy.
>> I doubt that would be supported in hardware.
> 
> Does not the same apply even stronger to your original suggestion to
> use 6-bit characters?

I don't recall whether the mainframes I used - and which of them - had
actually a "C" compiler; I think our 360-clone(?) at least had one. All
I can say is that it seems natural to support characters of appropriate
sizes. Our CDC (175 or 176; 60 bit) had used in Pascal 6 bit characters
(the 'text' data type was a 'packed array [1..10] of character'). And
I'd suppose that a 36 bit based architecture might use 9 bit characters
(or maybe use the spare bit just for error checking, or ignore it?).
Anyway, in my K&R version there's the "Honeywell 6000" hardware listed
with a 9 bit 'char' type.

Janis

[toc] | [prev] | [next] | [standalone]

#395985

From	scott@slp53.sl.home (Scott Lurndal)
Date	2025-12-26 16:28 +0000
Message-ID	<uIy3R.69122$gOda.66163@fx48.iad>
In reply to	#395967

Michael S <already5chosen@yahoo.com> writes:
>On Thu, 25 Dec 2025 02:00:16 -0600
>Lynn McGuire <lynnmcguire5@gmail.com> wrote:
>

>> >> Did C never work on the 6 bit machines such as the Univac 1108 (36
>> >> bit) =20
>> >=20
>> > Yes, there is a C compiler for the Univac machines.   The byte size
>> > is 9 bits. =20
>>=20
>> I get the feeling that you are messing with me.  That would be four 9=20
>> bit characters per 36 bit word.

Indeed, that would be the case.

You know, you can always look this stuff up.

https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Data_formats

[toc] | [prev] | [next] | [standalone]

#395991

From	Lynn McGuire <lynnmcguire5@gmail.com>
Date	2025-12-27 00:25 -0600
Message-ID	<10inu5h$385b0$1@dont-email.me>
In reply to	#395985

On 12/26/2025 10:28 AM, Scott Lurndal wrote:
> Michael S <already5chosen@yahoo.com> writes:
>> On Thu, 25 Dec 2025 02:00:16 -0600
>> Lynn McGuire <lynnmcguire5@gmail.com> wrote:
>>
> 
>>>>> Did C never work on the 6 bit machines such as the Univac 1108 (36
>>>>> bit) =20
>>>> =20
>>>> Yes, there is a C compiler for the Univac machines.   The byte size
>>>> is 9 bits. =20
>>> =20
>>> I get the feeling that you are messing with me.  That would be four 9=20
>>> bit characters per 36 bit word.
> 
> Indeed, that would be the case.
> 
> You know, you can always look this stuff up.
> 
> https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Data_formats

Wild.  I wrote Fortran IV/66 software on Univac 1108 from 1975 to 1980 
and never knew that it had quarter word instructions.  We stored 6 
characters in the 36 bit words (all upper case) until we ported to the 
IBM 370 in 1978 or 1979 when we had to switch to four characters per word.

You know, we ported to the Prime 450 in 1977 when we bought one.  If I 
remember correctly, the Prime was a 32 bit word / 8 bit byte machine so 
we did the 4 characters max for a integer on that port, not the IBM 370 
port.  All those years run together now so I am not sure which and what 
port happened when at all.

It was a major change in our software and used a lot more ram in storing 
characters in integer arrays.  We did not move to Fortran 77 until 1990 
or so since the mainframe vendors charged a lot more to use the F77 
compiler instead of the F66 compiler, compile time was way slower also.

Lynn

[toc] | [prev] | [next] | [standalone]

#396012

From	Lawrence D’Oliveiro <ldo@nz.invalid>
Date	2025-12-29 23:34 +0000
Message-ID	<10iv35t$1dnpl$1@dont-email.me>
In reply to	#395991

On Sat, 27 Dec 2025 00:25:51 -0600, Lynn McGuire wrote:

> We did not move to Fortran 77 until 1990 or so since the mainframe
> vendors charged a lot more to use the F77 compiler instead of the
> F66 compiler, compile time was way slower also.

Just in time for Fortran-90 to come out ...

[toc] | [prev] | [next] | [standalone]

#395993

From	Lynn McGuire <lynnmcguire5@gmail.com>
Date	2025-12-27 00:29 -0600
Message-ID	<10inucr$385b0$2@dont-email.me>
In reply to	#395967

On 12/25/2025 2:49 AM, Michael S wrote:
> On Thu, 25 Dec 2025 02:00:16 -0600
> Lynn McGuire <lynnmcguire5@gmail.com> wrote:
> 
>> On 12/24/2025 11:11 AM, Scott Lurndal wrote:
>>> Lynn McGuire <lynnmcguire5@gmail.com> writes:
>>>> On 12/24/2025 12:22 AM, Keith Thompson wrote:
>>>>> Lawrence D’Oliveiro <ldo@nz.invalid> writes:
>>>>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>>>>>>> Could you identify which document guarantees that every Unicode
>>>>>>> locale contains "UTF-8"?
>>>>>>
>>>>>> How else would it work? Bytes have to be 8-bit.
>>>>>
>>>>> I can't figure out what point you're trying to make.
>>>>>
>>>>> Obviously bytes in C have to be *at least* 8 bits, but I don't see
>>>>> the relevance.
>>>>>
>>>>> Take a look at the article to which you replied.  How does your
>>>>> followup have anything to do with it?
>>>>>
>>>>> One of several points that you snipped is that locale names can
>>>>> contain the string "utf8", not "UTF-8".
>>>>
>>>> Did C never work on the 6 bit machines such as the Univac 1108 (36
>>>> bit)
>>>
>>> Yes, there is a C compiler for the Univac machines.   The byte size
>>> is 9 bits.
>>
>> I get the feeling that you are messing with me.  That would be four 9
>> bit characters per 36 bit word.
>>
>> But the machinations to store that unnatural 9 bits would be crazy.
>> I doubt that would be supported in hardware.
>>
>> Lynn
>>
> 
> Does not the same apply even stronger to your original suggestion to
> use 6-bit characters?

Those 6 bit characters, upper case only, were on the 36 bit (Univac 
1108) or 60 bit (CDC 7600) machines.  Those machines were native 6 bit 
bytes, at 6 bytes per word or 10 bytes per word.

Those machines were superseded by the 32 bit machines with 8 bit 
characters.  And now we have the 64 bit machines with 8 bit characters. 
We will have 128 bit machines soon in the relative sense, if not already.

Lynn

[toc] | [prev] | [next] | [standalone]

Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →

csiph-web

Unicode...

Contents

#395672

#395673

#395674

#395679

#395675

#395682

#395681

#395680 — Binary Search Trees (Was Re: Unicode...)

#395930

#395931

#395932

#395937

#395952

#395965

#395967

#395970

#395985

#395991

#396012

#395993