Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #395270 > unrolled thread
| Started by | Michael Sanders <porkchop@invalid.foo> |
|---|---|
| First post | 2025-11-14 21:03 +0000 |
| Last post | 2025-11-23 22:05 +0000 |
| Articles | 20 on this page of 93 — 17 participants |
Back to article view | Back to comp.lang.c
Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 21:03 +0000
Re: Unicode... Kaz Kylheku <643-408-1753@kylheku.com> - 2025-11-14 21:20 +0000
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 21:46 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 16:12 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 00:46 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 18:47 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:10 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-15 13:51 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 22:31 +0000
Re: Unicode... richard@cogsci.ed.ac.uk (Richard Tobin) - 2025-11-14 23:23 +0000
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-14 23:51 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-14 16:11 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 00:49 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-15 05:51 +0100
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-15 06:24 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:28 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-19 11:56 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-21 02:21 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-21 11:10 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-16 00:38 +0000
Re: Unicode... bart <bc@freeuk.com> - 2025-11-21 17:03 +0000
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-21 17:39 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 06:39 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 11:55 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 14:10 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 13:38 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 15:08 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 14:28 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 15:51 +0100
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 16:05 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 16:35 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 18:13 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 17:35 +0000
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 17:39 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-22 15:24 -0800
Re: Unicode... bart <bc@freeuk.com> - 2025-11-23 00:14 +0000
Re: Unicode... David Brown <david.brown@hesbynett.no> - 2025-11-23 13:32 +0100
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 18:44 +0100
Re: Unicode... bart <bc@freeuk.com> - 2025-11-22 19:28 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-22 20:59 +0100
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-11-26 19:42 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:06 +0000
Re: Unicode... Mikko <mikko.levanto@iki.fi> - 2025-11-15 12:47 +0200
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:09 +0000
Re: Unicode... Mikko <mikko.levanto@iki.fi> - 2025-11-16 11:22 +0200
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 19:14 +0000
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-15 20:16 +0000
Unicode Sorting (Was Re: Unicode...) Michael Sanders <porkchop@invalid.foo> - 2025-11-16 20:30 +0000
Re: Unicode Sorting (Was Re: Unicode...) Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-11-16 16:13 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-17 23:49 +0000
Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-11-18 14:27 -0500
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-18 20:17 +0000
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-18 20:40 +0000
Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-11-19 09:08 -0500
Re: Unicode... Michael Bäuerle <michael.baeuerle@stz-e.de> - 2025-11-19 15:29 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-19 19:22 +0000
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-26 02:03 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-03 06:24 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-12-03 18:33 +0000
Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-03 14:01 -0500
Re: Unicode... bart <bc@freeuk.com> - 2025-12-03 20:15 +0000
Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-03 22:43 +0200
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-03 12:49 -0800
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-03 18:15 -0800
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-12-03 23:23 +0000
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-04 14:15 +0100
Re: Unicode... Bonita Montero <Bonita.Montero@gmail.com> - 2025-12-04 14:03 +0100
Binary Search Trees (Was Re: Unicode...) Michael Sanders <porkchop@invalid.foo> - 2025-12-04 04:11 +0000
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-24 06:17 +0000
Re: Unicode... Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-23 22:22 -0800
Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-24 01:41 -0600
Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-24 11:24 +0200
Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-24 17:11 +0000
Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-25 02:00 -0600
Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-25 10:49 +0200
Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-25 10:22 +0100
Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-26 16:28 +0000
Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-27 00:25 -0600
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:34 +0000
Re: Unicode... Lynn McGuire <lynnmcguire5@gmail.com> - 2025-12-27 00:29 -0600
Re: Unicode... Michael S <already5chosen@yahoo.com> - 2025-12-27 18:08 +0200
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:38 +0000
Re: Unicode... scott@slp53.sl.home (Scott Lurndal) - 2025-12-27 19:17 +0000
Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-27 20:47 +0100
Re: Unicode... Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2025-12-27 20:03 +0000
Re: Unicode... Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2025-12-27 20:05 +0000
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-29 23:39 +0000
Re: Unicode... Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-12-27 22:43 +0100
Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-31 18:04 -0500
Re: Unicode... Lawrence D’Oliveiro <ldo@nz.invalid> - 2025-12-31 23:11 +0000
Re: Unicode... James Kuyper <jameskuyper@alumni.caltech.edu> - 2025-12-31 18:36 -0500
Re: Unicode... Philipp Klaus Krause <pkk@spth.de> - 2025-11-23 12:42 +0100
Re: Unicode... Michael Sanders <porkchop@invalid.foo> - 2025-11-23 22:05 +0000
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
| From | bart <bc@freeuk.com> |
|---|---|
| Date | 2025-12-03 20:15 +0000 |
| Message-ID | <10gq5o5$3kjac$1@dont-email.me> |
| In reply to | #395671 |
On 03/12/2025 19:01, James Kuyper wrote:
> On 2025-12-03 13:33, Michael Sanders wrote:
> ...
>> We want portability across diverse OSs. In my case, the program
>> does NOT care what the character is, it simply needs to be able
>> to find it when searching data & displaying it in an ordered way.
>>
>> The code below works perfectly:
>>
>> #include <stdio.h>
>> #include <string.h>
>>
>> int utf8_display_width(const char *s) {
>> int w = 0;
>>
>> while (*s) {
>> unsigned char b = *s;
>> unsigned cp;
>> int n;
>>
>> // UTF-8 decoder
>> if (b <= 0x7F) { // 1-byte ASCII
>> cp = b;
>> n = 1;
>> } else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
>> cp = ((b & 0x1F) << 6) |
>> (s[1] & 0x3F);
>> n = 2;
>> } else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
>> cp = ((b & 0x0F) << 12) |
>> ((s[1] & 0x3F) << 6) |
>> (s[2] & 0x3F);
>> n = 3;
>> } else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
>> cp = ((b & 0x07) << 18) |
>> ((s[1] & 0x3F) << 12) |
>> ((s[2] & 0x3F) << 6) |
>> (s[3] & 0x3F);
>> n = 4;
>> } else { // invalid, treat as 1-byte
>> cp = b;
>> n = 1;
>> }
>>
>> // display width
>> if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
>> width)
>> else if ( // double-width characters...
>> (cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
>> (cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
>> (cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
>> (cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
>> (cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
>> ) { w += 2; }
>> // exceptional wide characters (unicode requirement I've read elsewhere)
>> else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
>> else { w += 1; } // normal width for everything else
>>
>> s += n;
>> }
>>
>> return w;
>> }
>>
>> int main(void) {
>> const char *tests[] = {
>> "hello",
>> "Café",
>> "漢字",
>> "✓",
>> "🙂",
>> NULL
>> };
>>
>> // find maximum display width in 1st column
>> int maxw = 0;
>> for (int i = 0; tests[i]; i++) {
>> int w = utf8_display_width(tests[i]);
>> if (w > maxw) maxw = w;
>> }
>>
>> // total padding after each 1st column + 3 spaces
>> int total_pad = maxw + 3;
>>
>> for (int i = 0; tests[i]; i++) {
>> int w = utf8_display_width(tests[i]);
>> int sl = strlen(tests[i]);
>> printf("%s", tests[i]);
>> int pad = total_pad - w;
>> while (pad-- > 0) putchar(' ');
>> printf("strlen: %d utf8 display width: %d\n", sl, w);
>> }
>>
>> return 0;
>> }
>>
>> // eof
>
>
> I find it confusing that this is supposed to "work perfectly" "across
> diverse OSs". The amount of space that a character takes up varies
> depending upon the installed fonts, especially on whether the font is
> monospaced or proportional. Those fonts can be different for display on
> screen or on a printer. I don't see any query to determine even what the
> current font is, much less what it's characteristics are. I don't know
> of any OS-independent way of collecting such information. Does this
> solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this output for
a Windows console display (with - used for space):
hello---strlen: 5 utf8 display width: 5
Café----strlen: 5 utf8 display width: 4
漢字----strlen: 6 utf8 display width: 4
✓-------strlen: 3 utf8 display width: 1
🙂------strlen: 4 utf8 display width: 2
I was hoping this would be lined up, but already, in a Thunderbird edit
Window, the last lines aren't lined up properly.
Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch).
It only looks alright in Windows and WSL consoles/terminals. But maybe
that's all that's needed.
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2025-12-03 22:43 +0200 |
| Message-ID | <20251203224305.00004d8e@yahoo.com> |
| In reply to | #395672 |
On Wed, 3 Dec 2025 20:15:02 +0000 bart <bc@freeuk.com> wrote: > > > This looks like a solution for a fixed-pitch font. I get this output > for a Windows console display (with - used for space): > > hello---strlen: 5 utf8 display width: 5 > Café----strlen: 5 utf8 display width: 4 It sounds as a luck. é in your text just happened to be encoded as U+00E9. What if it was encoded as U+0065,U+00B4 ? (Hopefully, I got the correct code, I can't really distinguish between similar diacritics). > 漢字----strlen: 6 utf8 display width: 4 > ✓-------strlen: 3 utf8 display width: 1 > 🙂------strlen: 4 utf8 display width: 2 > > I was hoping this would be lined up, but already, in a Thunderbird > edit Window, the last lines aren't lined up properly. > > Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch). > > It only looks alright in Windows and WSL consoles/terminals. But > maybe that's all that's needed. > > >
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-12-03 12:49 -0800 |
| Message-ID | <87bjkfnve4.fsf@example.invalid> |
| In reply to | #395672 |
bart <bc@freeuk.com> writes:
> On 03/12/2025 19:01, James Kuyper wrote:
[...]
>> I find it confusing that this is supposed to "work perfectly"
>> "across
>> diverse OSs". The amount of space that a character takes up varies
>> depending upon the installed fonts, especially on whether the font is
>> monospaced or proportional. Those fonts can be different for display on
>> screen or on a printer. I don't see any query to determine even what the
>> current font is, much less what it's characteristics are. I don't know
>> of any OS-independent way of collecting such information. Does this
>> solution "work perfectly" only for your own particular favorite font?
>
> This looks like a solution for a fixed-pitch font. I get this output
> for a Windows console display (with - used for space):
[...]
I think bart is right that this is specific to fixed-width fonts.
For a variable width font, 'W' is going to be wider than '|'.
See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
the "number of column positions of a wide-character code". It does
depend on the current locale.
The assumption seems to be that fixed-width fonts are expected to be
consistent about the widths of characters.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-12-03 18:15 -0800 |
| Message-ID | <877bv3ngad.fsf@example.invalid> |
| In reply to | #395674 |
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
> bart <bc@freeuk.com> writes:
>> On 03/12/2025 19:01, James Kuyper wrote:
> [...]
>>> I find it confusing that this is supposed to "work perfectly"
>>> "across
>>> diverse OSs". The amount of space that a character takes up varies
>>> depending upon the installed fonts, especially on whether the font is
>>> monospaced or proportional. Those fonts can be different for display on
>>> screen or on a printer. I don't see any query to determine even what the
>>> current font is, much less what it's characteristics are. I don't know
>>> of any OS-independent way of collecting such information. Does this
>>> solution "work perfectly" only for your own particular favorite font?
>>
>> This looks like a solution for a fixed-pitch font. I get this output
>> for a Windows console display (with - used for space):
> [...]
>
> I think bart is right that this is specific to fixed-width fonts.
> For a variable width font, 'W' is going to be wider than '|'.
>
> See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
> the "number of column positions of a wide-character code". It does
> depend on the current locale.
>
> The assumption seems to be that fixed-width fonts are expected to be
> consistent about the widths of characters.
And in fact Unicode specifies how many cell positions each printable
character occupies, or at least for some of them.
The following is quoted from wcwidth.c in the xterm sources. The text
was originally written by Markus Kuhn.
* For some graphical characters, the Unicode standard explicitly
* defines a character-cell width via the definition of the East Asian
* FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes.
* In all these cases, there is no ambiguity about which width a
* terminal shall use. For characters in the East Asian Ambiguous (A)
* class, the width choice depends purely on a preference of backward
* compatibility with either historic CJK or Western practice.
* Choosing single-width for these characters is easy to justify as
* the appropriate long-term solution, as the CJK practice of
* displaying these characters as double-width comes from historic
* implementation simplicity (8-bit encoded characters were displayed
* single-width and 16-bit ones double-width, even for Greek,
* Cyrillic, etc.) and not any typographic considerations.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Michael Sanders <porkchop@invalid.foo> |
|---|---|
| Date | 2025-12-03 23:23 +0000 |
| Message-ID | <10gqgpi$3or1n$1@dont-email.me> |
| In reply to | #395671 |
On Wed, 3 Dec 2025 14:01:38 -0500, James Kuyper wrote: > I find it confusing that this is supposed to "work perfectly" "across > diverse OSs". The amount of space that a character takes up varies > depending upon the installed fonts, especially on whether the font is > monospaced or proportional. Those fonts can be different for display on > screen or on a printer. I don't see any query to determine even what the > current font is, much less what it's characteristics are. I don't know > of any OS-independent way of collecting such information. Does this > solution "work perfectly" only for your own particular favorite font? Just for use in the terminal & yes it works as advertised. In my case I simply need to match the character the user passed to the program when searching for a record. I dont want or need to know what font is used. If the terminal can display it, then I want to use it. Example, user invokes: tinybase -s=漢字 data/*.tbf Output is... FILE: data/history.tbf LINE: 170 BLOCK: 4 CRC-8: 0x30 QUERY: 漢字 MATCH: 漢字 TAGS: China, History, <漢字>, [wrap:66] Ancient China... 1. Geography and Early Beginnings: Ancient China, a cradle of civilization, evolved along the Yellow River's fertile plains. Protected by the Himalayas to the south, the Gobi Desert to the north, and vast seas to the east, this geographic isolation allowed for a unique and continuous cultural development spanning millennia. ... James, earnestly intending no offense - add something to the conversion rather than complaining - I want to learn & solve problems that's where I'm seeking help. Just modify the code, make it get closer to your ideal. We'll all benefit. -- :wq Mike Sanders
[toc] | [prev] | [next] | [standalone]
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2025-12-04 14:15 +0100 |
| Message-ID | <10gs1gd$a1f1$1@raubtier-asyl.eternal-september.org> |
| In reply to | #395671 |
Am 03.12.2025 um 20:01 schrieb James Kuyper: > I find it confusing that this is supposed to "work perfectly" "across > diverse OSs". The amount of space that a character takes up varies > depending upon the installed fonts, especially on whether the font is > monospaced or proportional. Those fonts can be different for display on > screen or on a printer. I don't see any query to determine even what the > current font is, much less what it's characteristics are. I don't know > of any OS-independent way of collecting such information. Does this > solution "work perfectly" only for your own particular favorite font? Can C handle that with those means given by the standard itself. And is this really necessary to consider. Consoles are almost always fixed space. I guess the standard output for an laser printer in line printed mode is also fixed space.
[toc] | [prev] | [next] | [standalone]
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2025-12-04 14:03 +0100 |
| Message-ID | <10gs0qg$9mjl$1@raubtier-asyl.eternal-september.org> |
| In reply to | #395670 |
Am 03.12.2025 um 19:33 schrieb Michael Sanders:
> On Wed, 3 Dec 2025 06:24:23 +0100, Bonita Montero wrote:
>
>>> Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
>> Windows has the ...W() APIs along with codepage-based APIs with
>> the ...A() Suffix. The W()-APIs support UTF-16, so no need for
> Hi Bonita.
>
> Yes that's correct, but...
>
> - that assumes we know in advance what the character is
>
> - it would only work under Windows
>
> We want portability across diverse OSs. In my case, the program
> does NOT care what the character is, it simply needs to be able
> to find it when searching data & displaying it in an ordered way.
VC++ supports C- and C++ locale if you like to have it portable.
Especially the locale-support in C++ with its facets is very nice
to handle: https://en.cppreference.com/w/cpp/locale.html
>
> The code below works perfectly:
>
> #include <stdio.h>
> #include <string.h>
>
> int utf8_display_width(const char *s) {
> int w = 0;
>
> while (*s) {
> unsigned char b = *s;
> unsigned cp;
> int n;
>
> // UTF-8 decoder
> if (b <= 0x7F) { // 1-byte ASCII
> cp = b;
> n = 1;
> } else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
> cp = ((b & 0x1F) << 6) |
> (s[1] & 0x3F);
> n = 2;
> } else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
> cp = ((b & 0x0F) << 12) |
> ((s[1] & 0x3F) << 6) |
> (s[2] & 0x3F);
> n = 3;
> } else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
> cp = ((b & 0x07) << 18) |
> ((s[1] & 0x3F) << 12) |
> ((s[2] & 0x3F) << 6) |
> (s[3] & 0x3F);
> n = 4;
> } else { // invalid, treat as 1-byte
> cp = b;
> n = 1;
> }
>
> // display width
> if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero width)
> else if ( // double-width characters...
> (cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
> (cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
> (cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
> (cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
> (cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
> ) { w += 2; }
> // exceptional wide characters (unicode requirement I've read elsewhere)
> else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
> else { w += 1; } // normal width for everything else
>
> s += n;
> }
>
> return w;
> }
>
> int main(void) {
> const char *tests[] = {
> "hello",
> "Café",
> "漢字",
> "✓",
> "🙂",
> NULL
> };
>
> // find maximum display width in 1st column
> int maxw = 0;
> for (int i = 0; tests[i]; i++) {
> int w = utf8_display_width(tests[i]);
> if (w > maxw) maxw = w;
> }
>
> // total padding after each 1st column + 3 spaces
> int total_pad = maxw + 3;
>
> for (int i = 0; tests[i]; i++) {
> int w = utf8_display_width(tests[i]);
> int sl = strlen(tests[i]);
> printf("%s", tests[i]);
> int pad = total_pad - w;
> while (pad-- > 0) putchar(' ');
> printf("strlen: %d utf8 display width: %d\n", sl, w);
> }
>
> return 0;
> }
>
> // eof
>
[toc] | [prev] | [next] | [standalone]
| From | Michael Sanders <porkchop@invalid.foo> |
|---|---|
| Date | 2025-12-04 04:11 +0000 |
| Subject | Binary Search Trees (Was Re: Unicode...) |
| Message-ID | <10gr1ln$3uldg$1@dont-email.me> |
| In reply to | #395663 |
Ever worked with binary search trees Bonita?
I've been playing around with them, or was awhile back at least...
My criteria was to build nodes alphabetically:
- Left subtree contains keys less than the node
- Right subtree contains keys greater than the node
INSTRUMENTATION
I
/ \
E N
/ / \
A M S
/ / \
I R T
/ \
N U
\ /
O T
/ \
N T
--
:wq
Mike Sanders
[toc] | [prev] | [next] | [standalone]
| From | Lawrence D’Oliveiro <ldo@nz.invalid> |
|---|---|
| Date | 2025-12-24 06:17 +0000 |
| Message-ID | <10ig0i8$qjm9$3@dont-email.me> |
| In reply to | #395301 |
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: > Could you identify which document guarantees that every Unicode locale > contains "UTF-8"? How else would it work? Bytes have to be 8-bit.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-12-23 22:22 -0800 |
| Message-ID | <878qes8kll.fsf@example.invalid> |
| In reply to | #395930 |
Lawrence D’Oliveiro <ldo@nz.invalid> writes:
> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
>> Could you identify which document guarantees that every Unicode locale
>> contains "UTF-8"?
>
> How else would it work? Bytes have to be 8-bit.
I can't figure out what point you're trying to make.
Obviously bytes in C have to be *at least* 8 bits, but I don't see
the relevance.
Take a look at the article to which you replied. How does your
followup have anything to do with it?
One of several points that you snipped is that locale names can
contain the string "utf8", not "UTF-8".
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Lynn McGuire <lynnmcguire5@gmail.com> |
|---|---|
| Date | 2025-12-24 01:41 -0600 |
| Message-ID | <10ig5fa$rra5$1@dont-email.me> |
| In reply to | #395931 |
On 12/24/2025 12:22 AM, Keith Thompson wrote: > Lawrence D’Oliveiro <ldo@nz.invalid> writes: >> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: >>> Could you identify which document guarantees that every Unicode locale >>> contains "UTF-8"? >> >> How else would it work? Bytes have to be 8-bit. > > I can't figure out what point you're trying to make. > > Obviously bytes in C have to be *at least* 8 bits, but I don't see > the relevance. > > Take a look at the article to which you replied. How does your > followup have anything to do with it? > > One of several points that you snipped is that locale names can > contain the string "utf8", not "UTF-8". Did C never work on the 6 bit machines such as the Univac 1108 (36 bit) or the CDC 7600 (60 bit) ? Lynn
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2025-12-24 11:24 +0200 |
| Message-ID | <20251224112404.000015df@yahoo.com> |
| In reply to | #395932 |
On Wed, 24 Dec 2025 01:41:30 -0600 Lynn McGuire <lynnmcguire5@gmail.com> wrote: > On 12/24/2025 12:22 AM, Keith Thompson wrote: > > Lawrence D’Oliveiro <ldo@nz.invalid> writes: > >> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: > >>> Could you identify which document guarantees that every Unicode > >>> locale contains "UTF-8"? > >> > >> How else would it work? Bytes have to be 8-bit. > > > > I can't figure out what point you're trying to make. > > > > Obviously bytes in C have to be *at least* 8 bits, but I don't see > > the relevance. > > > > Take a look at the article to which you replied. How does your > > followup have anything to do with it? > > > > One of several points that you snipped is that locale names can > > contain the string "utf8", not "UTF-8". > > Did C never work on the 6 bit machines such as the Univac 1108 (36 > bit) or the CDC 7600 (60 bit) ? > > Lynn > It depends on definition of the word C. The requirement for CHAR_BIT > 7 was not present in K&R C. IIRC, it first came in C90. Also, what prevents C90 compiler from using 36-bit char on Univac 1108 and 60-bit bytes on CDC 7600? Methinks, it would be very reasonable. By chance, that* was a choice made both by TI and by Analog for C compilers of their word-addressable DSPs. * - not specifically 36 or 60 bits, but CHAR_BIT = native word width.
[toc] | [prev] | [next] | [standalone]
| From | scott@slp53.sl.home (Scott Lurndal) |
|---|---|
| Date | 2025-12-24 17:11 +0000 |
| Message-ID | <j9V2R.199268$79B9.139603@fx14.iad> |
| In reply to | #395932 |
Lynn McGuire <lynnmcguire5@gmail.com> writes: >On 12/24/2025 12:22 AM, Keith Thompson wrote: >> Lawrence D’Oliveiro <ldo@nz.invalid> writes: >>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: >>>> Could you identify which document guarantees that every Unicode locale >>>> contains "UTF-8"? >>> >>> How else would it work? Bytes have to be 8-bit. >> >> I can't figure out what point you're trying to make. >> >> Obviously bytes in C have to be *at least* 8 bits, but I don't see >> the relevance. >> >> Take a look at the article to which you replied. How does your >> followup have anything to do with it? >> >> One of several points that you snipped is that locale names can >> contain the string "utf8", not "UTF-8". > >Did C never work on the 6 bit machines such as the Univac 1108 (36 bit) Yes, there is a C compiler for the Univac machines. The byte size is 9 bits.
[toc] | [prev] | [next] | [standalone]
| From | Lynn McGuire <lynnmcguire5@gmail.com> |
|---|---|
| Date | 2025-12-25 02:00 -0600 |
| Message-ID | <10iiquh$1n1it$1@dont-email.me> |
| In reply to | #395952 |
On 12/24/2025 11:11 AM, Scott Lurndal wrote: > Lynn McGuire <lynnmcguire5@gmail.com> writes: >> On 12/24/2025 12:22 AM, Keith Thompson wrote: >>> Lawrence D’Oliveiro <ldo@nz.invalid> writes: >>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: >>>>> Could you identify which document guarantees that every Unicode locale >>>>> contains "UTF-8"? >>>> >>>> How else would it work? Bytes have to be 8-bit. >>> >>> I can't figure out what point you're trying to make. >>> >>> Obviously bytes in C have to be *at least* 8 bits, but I don't see >>> the relevance. >>> >>> Take a look at the article to which you replied. How does your >>> followup have anything to do with it? >>> >>> One of several points that you snipped is that locale names can >>> contain the string "utf8", not "UTF-8". >> >> Did C never work on the 6 bit machines such as the Univac 1108 (36 bit) > > Yes, there is a C compiler for the Univac machines. The byte size is > 9 bits. I get the feeling that you are messing with me. That would be four 9 bit characters per 36 bit word. But the machinations to store that unnatural 9 bits would be crazy. I doubt that would be supported in hardware. Lynn
[toc] | [prev] | [next] | [standalone]
| From | Michael S <already5chosen@yahoo.com> |
|---|---|
| Date | 2025-12-25 10:49 +0200 |
| Message-ID | <20251225104901.00005fb1@yahoo.com> |
| In reply to | #395965 |
On Thu, 25 Dec 2025 02:00:16 -0600 Lynn McGuire <lynnmcguire5@gmail.com> wrote: > On 12/24/2025 11:11 AM, Scott Lurndal wrote: > > Lynn McGuire <lynnmcguire5@gmail.com> writes: > >> On 12/24/2025 12:22 AM, Keith Thompson wrote: > >>> Lawrence D’Oliveiro <ldo@nz.invalid> writes: > >>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: > >>>>> Could you identify which document guarantees that every Unicode > >>>>> locale contains "UTF-8"? > >>>> > >>>> How else would it work? Bytes have to be 8-bit. > >>> > >>> I can't figure out what point you're trying to make. > >>> > >>> Obviously bytes in C have to be *at least* 8 bits, but I don't see > >>> the relevance. > >>> > >>> Take a look at the article to which you replied. How does your > >>> followup have anything to do with it? > >>> > >>> One of several points that you snipped is that locale names can > >>> contain the string "utf8", not "UTF-8". > >> > >> Did C never work on the 6 bit machines such as the Univac 1108 (36 > >> bit) > > > > Yes, there is a C compiler for the Univac machines. The byte size > > is 9 bits. > > I get the feeling that you are messing with me. That would be four 9 > bit characters per 36 bit word. > > But the machinations to store that unnatural 9 bits would be crazy. > I doubt that would be supported in hardware. > > Lynn > Does not the same apply even stronger to your original suggestion to use 6-bit characters?
[toc] | [prev] | [next] | [standalone]
| From | Janis Papanagnou <janis_papanagnou+ng@hotmail.com> |
|---|---|
| Date | 2025-12-25 10:22 +0100 |
| Message-ID | <10iivo3$25ihi$9@dont-email.me> |
| In reply to | #395967 |
On 2025-12-25 09:49, Michael S wrote: > On Thu, 25 Dec 2025 02:00:16 -0600 > Lynn McGuire <lynnmcguire5@gmail.com> wrote: > >> On 12/24/2025 11:11 AM, Scott Lurndal wrote: >>> Lynn McGuire <lynnmcguire5@gmail.com> writes: >>>> >>>> Did C never work on the 6 bit machines such as the Univac 1108 (36 >>>> bit) >>> >>> Yes, there is a C compiler for the Univac machines. The byte size >>> is 9 bits. >> >> I get the feeling that you are messing with me. That would be four 9 >> bit characters per 36 bit word. >> >> But the machinations to store that unnatural 9 bits would be crazy. >> I doubt that would be supported in hardware. > > Does not the same apply even stronger to your original suggestion to > use 6-bit characters? I don't recall whether the mainframes I used - and which of them - had actually a "C" compiler; I think our 360-clone(?) at least had one. All I can say is that it seems natural to support characters of appropriate sizes. Our CDC (175 or 176; 60 bit) had used in Pascal 6 bit characters (the 'text' data type was a 'packed array [1..10] of character'). And I'd suppose that a 36 bit based architecture might use 9 bit characters (or maybe use the spare bit just for error checking, or ignore it?). Anyway, in my K&R version there's the "Honeywell 6000" hardware listed with a 9 bit 'char' type. Janis
[toc] | [prev] | [next] | [standalone]
| From | scott@slp53.sl.home (Scott Lurndal) |
|---|---|
| Date | 2025-12-26 16:28 +0000 |
| Message-ID | <uIy3R.69122$gOda.66163@fx48.iad> |
| In reply to | #395967 |
Michael S <already5chosen@yahoo.com> writes: >On Thu, 25 Dec 2025 02:00:16 -0600 >Lynn McGuire <lynnmcguire5@gmail.com> wrote: > >> >> Did C never work on the 6 bit machines such as the Univac 1108 (36 >> >> bit) =20 >> >=20 >> > Yes, there is a C compiler for the Univac machines. The byte size >> > is 9 bits. =20 >>=20 >> I get the feeling that you are messing with me. That would be four 9=20 >> bit characters per 36 bit word. Indeed, that would be the case. You know, you can always look this stuff up. https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Data_formats
[toc] | [prev] | [next] | [standalone]
| From | Lynn McGuire <lynnmcguire5@gmail.com> |
|---|---|
| Date | 2025-12-27 00:25 -0600 |
| Message-ID | <10inu5h$385b0$1@dont-email.me> |
| In reply to | #395985 |
On 12/26/2025 10:28 AM, Scott Lurndal wrote: > Michael S <already5chosen@yahoo.com> writes: >> On Thu, 25 Dec 2025 02:00:16 -0600 >> Lynn McGuire <lynnmcguire5@gmail.com> wrote: >> > >>>>> Did C never work on the 6 bit machines such as the Univac 1108 (36 >>>>> bit) =20 >>>> =20 >>>> Yes, there is a C compiler for the Univac machines. The byte size >>>> is 9 bits. =20 >>> =20 >>> I get the feeling that you are messing with me. That would be four 9=20 >>> bit characters per 36 bit word. > > Indeed, that would be the case. > > You know, you can always look this stuff up. > > https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Data_formats Wild. I wrote Fortran IV/66 software on Univac 1108 from 1975 to 1980 and never knew that it had quarter word instructions. We stored 6 characters in the 36 bit words (all upper case) until we ported to the IBM 370 in 1978 or 1979 when we had to switch to four characters per word. You know, we ported to the Prime 450 in 1977 when we bought one. If I remember correctly, the Prime was a 32 bit word / 8 bit byte machine so we did the 4 characters max for a integer on that port, not the IBM 370 port. All those years run together now so I am not sure which and what port happened when at all. It was a major change in our software and used a lot more ram in storing characters in integer arrays. We did not move to Fortran 77 until 1990 or so since the mainframe vendors charged a lot more to use the F77 compiler instead of the F66 compiler, compile time was way slower also. Lynn
[toc] | [prev] | [next] | [standalone]
| From | Lawrence D’Oliveiro <ldo@nz.invalid> |
|---|---|
| Date | 2025-12-29 23:34 +0000 |
| Message-ID | <10iv35t$1dnpl$1@dont-email.me> |
| In reply to | #395991 |
On Sat, 27 Dec 2025 00:25:51 -0600, Lynn McGuire wrote: > We did not move to Fortran 77 until 1990 or so since the mainframe > vendors charged a lot more to use the F77 compiler instead of the > F66 compiler, compile time was way slower also. Just in time for Fortran-90 to come out ...
[toc] | [prev] | [next] | [standalone]
| From | Lynn McGuire <lynnmcguire5@gmail.com> |
|---|---|
| Date | 2025-12-27 00:29 -0600 |
| Message-ID | <10inucr$385b0$2@dont-email.me> |
| In reply to | #395967 |
On 12/25/2025 2:49 AM, Michael S wrote: > On Thu, 25 Dec 2025 02:00:16 -0600 > Lynn McGuire <lynnmcguire5@gmail.com> wrote: > >> On 12/24/2025 11:11 AM, Scott Lurndal wrote: >>> Lynn McGuire <lynnmcguire5@gmail.com> writes: >>>> On 12/24/2025 12:22 AM, Keith Thompson wrote: >>>>> Lawrence D’Oliveiro <ldo@nz.invalid> writes: >>>>>> On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote: >>>>>>> Could you identify which document guarantees that every Unicode >>>>>>> locale contains "UTF-8"? >>>>>> >>>>>> How else would it work? Bytes have to be 8-bit. >>>>> >>>>> I can't figure out what point you're trying to make. >>>>> >>>>> Obviously bytes in C have to be *at least* 8 bits, but I don't see >>>>> the relevance. >>>>> >>>>> Take a look at the article to which you replied. How does your >>>>> followup have anything to do with it? >>>>> >>>>> One of several points that you snipped is that locale names can >>>>> contain the string "utf8", not "UTF-8". >>>> >>>> Did C never work on the 6 bit machines such as the Univac 1108 (36 >>>> bit) >>> >>> Yes, there is a C compiler for the Univac machines. The byte size >>> is 9 bits. >> >> I get the feeling that you are messing with me. That would be four 9 >> bit characters per 36 bit word. >> >> But the machinations to store that unnatural 9 bits would be crazy. >> I doubt that would be supported in hardware. >> >> Lynn >> > > Does not the same apply even stronger to your original suggestion to > use 6-bit characters? Those 6 bit characters, upper case only, were on the 36 bit (Univac 1108) or 60 bit (CDC 7600) machines. Those machines were native 6 bit bytes, at 6 bytes per word or 10 bytes per word. Those machines were superseded by the 32 bit machines with 8 bit characters. And now we have the 64 bit machines with 8 bit characters. We will have 128 bit machines soon in the relative sense, if not already. Lynn
[toc] | [prev] | [next] | [standalone]
Page 4 of 5 — ← Prev page 1 2 3 [4] 5 Next page →
Back to top | Article view | comp.lang.c
csiph-web