Groups > comp.lang.c > #379587 > unrolled thread

Simple(?) Unicode questions

Started by	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
First post	2023-12-09 08:04 +0100
Last post	2024-01-24 20:38 -0800
Articles	20 on this page of 22 — 9 participants

Back to article view | Back to comp.lang.c

  Simple(?) Unicode questions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2023-12-09 08:04 +0100
    Re: Simple(?) Unicode questions Richard Damon <richard@damon-family.org> - 2023-12-09 08:01 -0500
    Re: Simple(?) Unicode questions jak <nospam@please.ty> - 2023-12-09 15:59 +0100
      Re: Simple(?) Unicode questions Spiros Bousbouras <spibou@gmail.com> - 2023-12-09 15:32 +0000
        Re: Simple(?) Unicode questions jak <nospam@please.ty> - 2023-12-09 18:57 +0100
    Re: Simple(?) Unicode questions Spiros Bousbouras <spibou@gmail.com> - 2023-12-09 15:12 +0000
      Re: Simple(?) Unicode questions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2023-12-09 17:59 +0100
        Re: Simple(?) Unicode questions Spiros Bousbouras <spibou@gmail.com> - 2023-12-09 17:19 +0000
          Re: Simple(?) Unicode questions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2023-12-09 18:43 +0100
      Re: Simple(?) Unicode questions Spiros Bousbouras <spibou@gmail.com> - 2023-12-09 17:40 +0000
      Re: Simple(?) Unicode questions Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2023-12-09 13:46 -0800
    Re: Simple(?) Unicode questions spender <spender@yeah.net> - 2023-12-13 11:05 +0800
      Re: Simple(?) Unicode questions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2023-12-13 04:24 +0100
      Re: Simple(?) Unicode questions Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2023-12-12 19:28 -0800
      Re: Simple(?) Unicode questions James Kuyper <jameskuyper@alumni.caltech.edu> - 2023-12-13 00:40 -0500
        Re: Simple(?) Unicode questions Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-01-19 07:43 -0800
      Re: Simple(?) Unicode questions Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2023-12-13 14:56 +0000
        Re: Simple(?) Unicode questions Tim Rentsch <tr.17687@z991.linuxsc.com> - 2023-12-25 02:03 -0800
          Re: Simple(?) Unicode questions Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2023-12-25 14:43 -0800
            Re: Simple(?) Unicode questions Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-01-20 09:33 -0800
              Re: Simple(?) Unicode questions Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-01-20 14:19 -0800
                Re: Simple(?) Unicode questions Tim Rentsch <tr.17687@z991.linuxsc.com> - 2024-01-24 20:38 -0800

Page 1 of 2 [1] 2 Next page →

#379587 — Simple(?) Unicode questions

From	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date	2023-12-09 08:04 +0100
Subject	Simple(?) Unicode questions
Message-ID	<ul13hl$24kg5$1@dont-email.me>

After decades I'm again writing some C code and intended to use some
Unicode characters for output.  I'm using C99.  I have two questions.

I am able to inline the character in the code like:  printf ("█\n");

But I also want to make it a printf argument:  printf ("%c\n", '█');
which doesn't work (at least not in the depicted way).

And I want to declare such characters, like:  char ch = '█';
which also doesn't work, and neither does:  wchar_t ch = '█';
And ideally the character should not be copy/pasted into the code
but given by some standard representation like '\u2588' (or so).

Without giving all the gory details about the "problems of Unicode",
are there practical answers to those questions that "simply work"
and reliably?

I have experimented and observed that working with strings at least
*seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
Is that an acceptable/reliable and the usual way in C to tackle the
issue?

Thanks.

Janis

[toc] | [next] | [standalone]

#379588

From	Richard Damon <richard@damon-family.org>
Date	2023-12-09 08:01 -0500
Message-ID	<ul1oel$3aems$1@i2pn2.org>
In reply to	#379587

On 12/9/23 2:04 AM, Janis Papanagnou wrote:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output.  I'm using C99.  I have two questions.

There are several things that are considered a "Character" in C.

we have the "char", which is a single "narrow" character,
we have character strings, which can represent multi-byte-characters
we have "wchar", which can represent "wide" characters as a single unit.

> 
> I am able to inline the character in the code like:  printf ("█\n");

Because, while it isn't a single "narrow character", but can be 
converted into a "multi-byte-character-string" that represents that 
character.
> 
> But I also want to make it a printf argument:  printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).

Because it isn't a "narrow character" and thus can't be put into a 
single "char"

> 
> And I want to declare such characters, like:  char ch = '█';
> which also doesn't work, and neither does:  wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).

you can use wchar ch = L'█'; or wchar ch = L'\u2588';
The key is that you are creating a WIDE character, not a narrow character.

> 
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
> 
> I have experimented and observed that working with strings at least
> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
> 
> Thanks.
> 
> Janis

You need to make a decision if you will represent the bigger set of 
characters as always using wide characters, or 
multi-byte-character-strings.

Most often, it is the multi-byte-character-string, as wide characters 
are less well supported in most systems.

[toc] | [prev] | [next] | [standalone]

#379589

From	jak <nospam@please.ty>
Date	2023-12-09 15:59 +0100
Message-ID	<ul1vbr$289m4$1@dont-email.me>
In reply to	#379587

Janis Papanagnou ha scritto:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output.  I'm using C99.  I have two questions.
> 
> I am able to inline the character in the code like:  printf ("█\n");
> 
> But I also want to make it a printf argument:  printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).
> 
> And I want to declare such characters, like:  char ch = '█';
> which also doesn't work, and neither does:  wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).
> 
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
> 
> I have experimented and observed that working with strings at least
> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
> 
> Thanks.
> 
> Janis
> 

HI,
You merged two questions together. I will try to divide them:
Initialization of wchar_t types:
like char strings can be initialized with literal strings:

char str[] = "Hello";

the same can be done for wchar_t type strings using the prefix L:

wchar_t wstr[] = L"Hello";
wchar_t wstr[] = L"█\n";
wchar_t wstr[] = L"\u2588\n";

A similar thing is possible for individual characters:

char ch = 'a';
wchar_t wch = L'a';

with the prefix L, it is therefore possible to use extensive characters:

wchar_t wch = L'█';
or:
wchar_t wch = 0x2588;
or:
wchar_t wch = L'\u2588';
or:
wchar_t wch = L"\u2588"[0];
or:
wchar_t wch = *L"█";

Also for the printf there is the relative formatting prefix ('l') for
the wchar_t type:

printf("%s", str);
printf("%ls", wstr);

printf("%c", ch);
printf("%lc", wch);

But it would be more correct to use the suitable version of the wchar_t
(on many occasions it is also more comfortable):

wprintf(L"%ls", wstr);
wprintf(L"%lc", wch);

However, remember to configure the 'locale' for viewing on your
terminal, otherwise the characters you will see may not be the ones you
expect or you will not see at all. Using the 'setlocale' function will
allow the program to convert between the character that prints and the
one corresponding to the locale of your terminal.
To explain myself better if I write a program that prints an extended
unicode character and my terminal uses the UTF-8 characters if the
program does not convert the character from Unicode to UTF-8 I will not
see anything. To prove it I will send the character to a file:

$> cat foo.c
#include <stdio.h>
#include <stddef.h>
#include <wchar.h>
#include <locale.h>

int main()
{
     wchar_t wch = L'\u2588';
     FILE *fp;

     setlocale(LC_ALL, "");

     if((fp = fopen("char.txt", "wb")) != NULL)
     {
         fwprintf(fp, L"%lc", wch);
         fclose(fp);
     }
     return 0;
}

$> hexdump -C  char.txt
00000000  e2 96 88                                          |...|
00000003

As you can see the character code is not the same that I sent. With
python it is easy to highlight the conversion:

$> python
  >>> u'\u2588'.encode('utf-8')
  b'\xe2\x96\x88'

$>

[toc] | [prev] | [next] | [standalone]

#379591

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-12-09 15:32 +0000
Message-ID	<77XeQojqcvfK7uNgV@bongo-ra.co>
In reply to	#379589

On Sat, 9 Dec 2023 15:59:08 +0100
jak <nospam@please.ty> wrote:
> To explain myself better if I write a program that prints an extended
> unicode character and my terminal uses the UTF-8 characters if the
> program does not convert the character from Unicode to UTF-8 I will not
> see anything. To prove it I will send the character to a file:
> 
> $> cat foo.c
> #include <stdio.h>
> #include <stddef.h>
> #include <wchar.h>
> #include <locale.h>
> 
> int main()
> {
>      wchar_t wch = L'\u2588';
>      FILE *fp;
> 
>      setlocale(LC_ALL, "");
> 
>      if((fp = fopen("char.txt", "wb")) != NULL)
>      {
>          fwprintf(fp, L"%lc", wch);
>          fclose(fp);
>      }
>      return 0;
> }
> 
> $> hexdump -C  char.txt
> 00000000  e2 96 88                                          |...|
> 00000003
> 
> As you can see the character code is not the same that I sent.

In what way is it not the same as what you sent ? With  hexdump  you
can only hope to see octets regardless of what the octets encode. So
you read back the octets which are the UTF-8 encoding of codepoint
U+2588 .What you got is exactly what I would expect to see. If you
use a terminal which supports UTF-8 and has the necessary font and
you do

    cat char.txt

what do you see ? I expect you will see the block character.

> With python it is easy to highlight the conversion:
> 
> $> python
>   >>> u'\u2588'.encode('utf-8')
>   b'\xe2\x96\x88'

[toc] | [prev] | [next] | [standalone]

#379596

From	jak <nospam@please.ty>
Date	2023-12-09 18:57 +0100
Message-ID	<ul29pc$29sdi$1@dont-email.me>
In reply to	#379591

Spiros Bousbouras ha scritto:
> On Sat, 9 Dec 2023 15:59:08 +0100
> jak <nospam@please.ty> wrote:
>> To explain myself better if I write a program that prints an extended
>> unicode character and my terminal uses the UTF-8 characters if the
>> program does not convert the character from Unicode to UTF-8 I will not
>> see anything. To prove it I will send the character to a file:
>>
>> $> cat foo.c
>> #include <stdio.h>
>> #include <stddef.h>
>> #include <wchar.h>
>> #include <locale.h>
>>
>> int main()
>> {
>>       wchar_t wch = L'\u2588';
>>       FILE *fp;
>>
>>       setlocale(LC_ALL, "");
>>
>>       if((fp = fopen("char.txt", "wb")) != NULL)
>>       {
>>           fwprintf(fp, L"%lc", wch);
>>           fclose(fp);
>>       }
>>       return 0;
>> }
>>
>> $> hexdump -C  char.txt
>> 00000000  e2 96 88                                          |...|
>> 00000003
>>
>> As you can see the character code is not the same that I sent.
> 
> In what way is it not the same as what you sent ? With  hexdump  you
> can only hope to see octets regardless of what the octets encode. So
> you read back the octets which are the UTF-8 encoding of codepoint
> U+2588 .What you got is exactly what I would expect to see. If you
> use a terminal which supports UTF-8 and has the necessary font and
> you do
> 

Sorry but your comment is not clear to me. I gave this explanation
because it seemed to me that it was not clear to the OP that a
conversion takes place during the printf. Also I wouldn't take what
you say for granted:

$> cat foo.c
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main()
{
     union {
         unsigned char c[0];
         wchar_t w[10];
     } str = {.w = L"\u2588"};

     setlocale(LC_ALL, "");

     printf("\nraw data: ");
     for(size_t i = 0; str.c[i] != '\0'; i++)
         printf("%02X  ", str.c[i]);
     printf("\n");

     FILE *fp;
     if((fp = fopen("char.txt", "wb")))
     {
         fwprintf(fp, L"%ls", str.w);
         fclose(fp);
     }
}

compiled with gcc:
$> gcc foo.c -o foo
$> foo

raw data: 88  25

$> od -t x1 char.txt
0000000 e2 96 88
0000003

compiled with tcc:
$> tcc foo.c
$> foo

raw data: 88  25

$> od -t x1 char.txt
0000000 88 25
0000002

ops...

>      cat char.txt
> 
> what do you see ? I expect you will see the block character.
> 
>> With python it is easy to highlight the conversion:
>>
>> $> python
>>    >>> u'\u2588'.encode('utf-8')
>>    b'\xe2\x96\x88'

[toc] | [prev] | [next] | [standalone]

#379590

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-12-09 15:12 +0000
Message-ID	<=H=fRiU4BbThlUWDM@bongo-ra.co>
In reply to	#379587

On Sat, 9 Dec 2023 08:04:20 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output.  I'm using C99.  I have two questions.

I assume you have already read  <ul1oel$3aems$1@i2pn2.org> .I will add
that for  printf() and wide characters or strings you need  %lc  and
%ls  respectively. The C99 standard says for  %ls

    If an l length modifier is present, the argument shall be a pointer to
    the initial element of an array of wchar_t type. Wide characters from the
    array are converted to multibyte characters (each as if by a call to the
    wcrtomb function, with the conversion state described by an mbstate_t
    object initialized to zero before the first wide character is converted)
    up to and including a terminating null wide character. The resulting
    multibyte characters are written up to (but not including) the
    terminating null character (byte).

I don't think there is a standard way to determine which conversions
wcrtomb() can handle. Not only that but those depend on what the LC_CTYPE
locale category has.

My own approach would be to do as much as possible in my own code. A lot
depends on whether you need to pass your own characters (of whatever type) to
some external library which expects a specific type like   wchar_t  or not.
There are many different scenarios so I will cover what would be most likely
to occur in my own code.

- No external library involved.
- Output encoded in UTF-8
- The text editor I use to write the code stores everything as UTF-8.

With the above assumptions I would simply use ordinary C strings and put
UTF-8 in them like  "ΑΒΓΔΕΖΗΘ..."  and output them in the ordinary way.
It's not guaranteed to work but it most likely will.

If I want to use directly unicode codepoints I will encode them as
unsigned long  which is guaranteed to be wide enough to cover the whole
range of codepoints values ; in contrast , it is conforming for  wchar_t  
to cover no greater range than  char .Converting from codepoints to UTF-8
is an easy and pleasant exercise. So I may have

typedef unsigned long codepoint ;
codepoint my_wide_string = { \x2588 , ... } ;

Then convert from that to UTF-8 and output the UTF-8 octets.

With this approach you can store the codepoints in whatever textual
representation you want , say in some configuration file and read that
during the start up of your programme.

[...]

> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).

Why is that ? It seems to me that it makes the code harder to understand.

> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?

What works reliably depends a lot on what you're trying to do. Unicode in
general is messy.

> I have experimented and observed that working with strings at least
> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?

If you do

    char * ch = "\u2588" 
    size_t i ;
    for (i = 0 ; ch[i] != 0 ; i++) {
        printf("%d   " , ch[i]) ;
    }
    puts("") ;

what output do you get ? I will guess that you see the bytes
226 150 136 .

-- 
vlaho.ninja/menu

[toc] | [prev] | [next] | [standalone]

#379592

From	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date	2023-12-09 17:59 +0100
Message-ID	<ul26dl$29c3i$1@dont-email.me>
In reply to	#379590

Thanks Richard, jak, and Spiros, for your explanations!

Some comments on the net about building wrappers around libraries,
and whatnot, irritated me.

In my initial tries I got confused about the error/warning message;
I had omitted the 'L' prefix for the character literal definition.
So that hint helped to get some assurance here.

On 09.12.2023 16:12, Spiros Bousbouras wrote:
> 
> My own approach would be to do as much as possible in my own code.

Same here.

If possible, I want to avoid external libraries, unnecessary
dependencies, and language constructs that are not guaranteed to
work reliably or that are non-portable, and I like simplicity and
transparency.

> A lot
> depends on whether you need to pass your own characters (of whatever type) to
> some external library which expects a specific type like   wchar_t  or not.
> There are many different scenarios so I will cover what would be most likely
> to occur in my own code.

My requirements are quite trivial and there's no exchange of data
between systems, processes, or applications. It's only data to be
displayed at the local screen.

> 
> - No external library involved.
> - Output encoded in UTF-8
> - The text editor I use to write the code stores everything as UTF-8.
> 
> With the above assumptions I would simply use ordinary C strings and put
> UTF-8 in them like  "ΑΒΓΔΕΖΗΘ..."  and output them in the ordinary way.

> It's not guaranteed to work but it most likely will.

That exactly was my uncertainty.

> [...]
> 
>> And ideally the character should not be copy/pasted into the code
>> but given by some standard representation like '\u2588' (or so).
> 
> Why is that ? It seems to me that it makes the code harder to understand.

I'm not encoding non-latin texts (like your Greek example above).

In my case the characters are just "graphical candy", so it's not
important to "read" them; a comment behind the \u encoding appears
to me to be sufficient.

It may also be a habit to have a program coded as ASCII source;
during my first decades of programming there were no languages
that I used that supported anything else than ASCII (or EBCDIC,
or even less, like 6-bit character sets, in some cases [CDC]).

This way (so my assumption goes) also less things will possibly
go wrong. I also never programmed in languages where the program
could be written in ones native (non-English) language by using
Unicode or UTF-8 encoding. I think I had the possibility in Java
(but these days were nothing but an episode as seen from today).

> 
> What works reliably depends a lot on what you're trying to do. Unicode in
> general is messy.

Yeah, that's why I want to keep it as simple as possible; but it
should of course work reliably.

> 
>> I have experimented and observed that working with strings at least
>> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
>> Is that an acceptable/reliable and the usual way in C to tackle the
>> issue?
> 
> If you do
> 
>     char * ch = "\u2588" 
>     size_t i ;
>     for (i = 0 ; ch[i] != 0 ; i++) {
>         printf("%d   " , ch[i]) ;
>     }
>     puts("") ;
> 
> what output do you get ? I will guess that you see the bytes
> 226 150 136 .

Almost. I get the complementary values:   -30   -106   -120

But why are you asking? - To show that "\u2588" is internally
represented by a [UTF-8] code sequence? - Ideally the interface
should not make me care about internal representations. :-)

The explanations and hints were all helpful - thanks again!

Janis

[toc] | [prev] | [next] | [standalone]

#379593

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-12-09 17:19 +0000
Message-ID	<5zg=MX9oDHkwG45=8@bongo-ra.co>
In reply to	#379592

On Sat, 9 Dec 2023 17:59:32 +0100
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

> On 09.12.2023 16:12, Spiros Bousbouras wrote:
> > Why is that ? It seems to me that it makes the code harder to understand.
> 
> I'm not encoding non-latin texts (like your Greek example above).
> 
> In my case the characters are just "graphical candy", so it's not
> important to "read" them; a comment behind the \u encoding appears
> to me to be sufficient.

Well , it's your code. If it is some kind of block characters based
"art" then it may even be more important to be able to see it in the
source.

> It may also be a habit to have a program coded as ASCII source;
> during my first decades of programming there were no languages
> that I used that supported anything else than ASCII (or EBCDIC,
> or even less, like 6-bit character sets, in some cases [CDC]).
> 
> This way (so my assumption goes) also less things will possibly
> go wrong. I also never programmed in languages where the program
> could be written in ones native (non-English) language by using
> Unicode or UTF-8 encoding. I think I had the possibility in Java
> (but these days were nothing but an episode as seen from today).

In this age , it is probably an unnecessarily restrictive habit. If
anything , you *should* try to go beyond ASCII whenever it would be
useful so that you will get to see what works and what doesn't. I
think you will find that a lot just works.

> > What works reliably depends a lot on what you're trying to do. Unicode in
> > general is messy.
> 
> Yeah, that's why I want to keep it as simple as possible; but it
> should of course work reliably.
> 
> > 
> >> I have experimented and observed that working with strings at least
> >> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
> >> Is that an acceptable/reliable and the usual way in C to tackle the
> >> issue?
> > 
> > If you do
> > 
> >     char * ch = "\u2588" 
> >     size_t i ;
> >     for (i = 0 ; ch[i] != 0 ; i++) {
> >         printf("%d   " , ch[i]) ;
> >     }
> >     puts("") ;
> > 
> > what output do you get ? I will guess that you see the bytes
> > 226 150 136 .
> 
> Almost. I get the complementary values:   -30   -106   -120

Ahh yes , of course , my mistake. By the way , that's one of the things which
is not guaranteed by the standard to work. If  char  has the range from -128
to 127 then converting from values  >= 128  results in an

    either the result is implementation-defined or an implementation-defined
    signal is raised.

.But almost certainly it will work.

> But why are you asking? - To show that "\u2588" is internally
> represented by a [UTF-8] code sequence?

Yes. If it is (which seems to be in your case) , that's a good sign
that you can keep things simple and avoid conversions and wide
characters.

> Ideally the interface
> should not make me care about internal representations. :-)

-- 
My sister, also a conductor, once explained to the board of one of her
orchestras why she wouldn't let them play Mozart in her first season;
"Mozart" she said, "is the string bikini of composers, and I just think that
we, as an orchestra, don't have the body to pull it off yet."
  https://kennethwoods.net/blog1/2012/06/25/which-would-you-rather-conduct-or-joining-the-mozart-protection-society/

[toc] | [prev] | [next] | [standalone]

#379595

From	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date	2023-12-09 18:43 +0100
Message-ID	<ul290e$29osl$1@dont-email.me>
In reply to	#379593

On 09.12.2023 18:19, Spiros Bousbouras wrote:
> On Sat, 9 Dec 2023 17:59:32 +0100
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>>
>> In my case the characters are just "graphical candy", so it's not
>> important to "read" them; a comment behind the \u encoding appears
>> to me to be sufficient.
> 
> Well , it's your code. If it is some kind of block characters based
> "art" then it may even be more important to be able to see it in the
> source.

I actually do have them visibly in my code; but non-functional,
as a comment. That way I have both, the [functional] safety and
the "readability". (And I don't mind the redundancy here.)

BTW, I also had situations the other way round, where I encode
programmatically characters and add comments with their values
(in decimal, hex, or binary, as it fits best for the purpose).
As an example, I had a case with similar or even equal glyphs,
and I wanted to have them specified exactly. A copy/paste from
some Web resource would, in my book, not have been good enough
for specification purposes; you couldn't tell them apart.

Janis

[toc] | [prev] | [next] | [standalone]

#379594

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-12-09 17:40 +0000
Message-ID	<3wItOAziDrtvid93G@bongo-ra.co>
In reply to	#379590

On Sat, 9 Dec 2023 15:12:55 -0000 (UTC)
Spiros Bousbouras <spibou@gmail.com> wrote:
> If I want to use directly unicode codepoints I will encode them as
> unsigned long  which is guaranteed to be wide enough to cover the whole
> range of codepoints values ; in contrast , it is conforming for  wchar_t  
> to cover no greater range than  char .Converting from codepoints to UTF-8
> is an easy and pleasant exercise. So I may have
> 
> typedef unsigned long codepoint ;
> codepoint my_wide_string = { \x2588 , ... } ;

codepoint my_wide_string[...] = { \x2588 , ... } ;

> Then convert from that to UTF-8 and output the UTF-8 octets.

[toc] | [prev] | [next] | [standalone]

#379597

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2023-12-09 13:46 -0800
Message-ID	<877clnxd5o.fsf@nosuchdomain.example.com>
In reply to	#379590

Spiros Bousbouras <spibou@gmail.com> writes:
[...]
> If I want to use directly unicode codepoints I will encode them as
> unsigned long  which is guaranteed to be wide enough to cover the whole
> range of codepoints values ; in contrast , it is conforming for  wchar_t  
> to cover no greater range than  char.
[...]

The C standard requires wchar_t to be: "an integer type whose range of
values can represent distinct codes for all members of the largest
extended character set specified among the supported locales".

Yes, it's conforming for wchar_t to cover a range no wider than char,
but only if the implementation has no extended character sets wider than
char.

On Linux-based systems, wchar_t is typically 32 bits, more than enough
to cover all Unicode codepoints.  On Windows, however, wchar_t is
generally only 16 bits, which (I think) is non-conforming.

(Microsoft started to support Unicode when the standard specified only
up to 2**16 codepoints, so UCS-2 was sufficient.  When Unicode expanded
beyond the Basic Multilingual Plane, Microsoft handled it by supporting
UTF-16, a variable-length encoding composed of 16-bit characters.
Inertia made it too difficult to expand wchar_t from 16 to 32 bits.)

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#379598

From	spender <spender@yeah.net>
Date	2023-12-13 11:05 +0800
Message-ID	<ulb729$3t0bp$1@dont-email.me>
In reply to	#379587

printf("%c",ch), the ch must <0xFF, <255

In c lang, The character must be a character of an ASCII table, i.e. < 
(int)255. A string is a collection of characters.


在 2023/12/9 15:04, Janis Papanagnou 写道:
> After decades I'm again writing some C code and intended to use some
> Unicode characters for output.  I'm using C99.  I have two questions.
> 
> I am able to inline the character in the code like:  printf ("█\n");
> 
> But I also want to make it a printf argument:  printf ("%c\n", '█');
> which doesn't work (at least not in the depicted way).
> 
> And I want to declare such characters, like:  char ch = '█';
> which also doesn't work, and neither does:  wchar_t ch = '█';
> And ideally the character should not be copy/pasted into the code
> but given by some standard representation like '\u2588' (or so).
> 
> Without giving all the gory details about the "problems of Unicode",
> are there practical answers to those questions that "simply work"
> and reliably?
> 
> I have experimented and observed that working with strings at least
> *seems* to work:  char * ch = "\u2588";  printf ("%s\n", ch);
> Is that an acceptable/reliable and the usual way in C to tackle the
> issue?
> 
> Thanks.
> 
> Janis

[toc] | [prev] | [next] | [standalone]

#379600

From	Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date	2023-12-13 04:24 +0100
Message-ID	<ulb859$12gh$1@dont-email.me>
In reply to	#379598

> 在 2023/12/9 15:04, Janis Papanagnou 写道:
>> [...] intended to use some Unicode characters for output.  [...]

On 13.12.2023 04:05, spender wrote:
> printf("%c",ch), the ch must <0xFF, <255

The question was about the output of multi-octet Unicode characters,
it was not about single octet characters.

Though the question has also already been addressed by the other
replies, so don't bother.

> 
> In c lang, The character must be a character of an ASCII table,
> i.e. < (int)255. A string is a collection of characters.

(Note, ASCII is 7 bit.) In the C language ordinary single-octet
characters may have values of -128..+127 or 0..255, depending on
whether the char type is defined as signed or unsigned.

And you can also output Unicode characters as had been showed in
this thread.

Janis

[toc] | [prev] | [next] | [standalone]

#379601

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2023-12-12 19:28 -0800
Message-ID	<87wmtiwzkc.fsf@nosuchdomain.example.com>
In reply to	#379598

spender <spender@yeah.net> writes:
> printf("%c",ch), the ch must <0xFF, <255
>
> In c lang, The character must be a character of an ASCII table, i.e. <
> (int)255. A string is a collection of characters.
[...]

Not exactly.

C doesn't require ASCII; there are implementations that use EBCDIC, for
example.

The argument corresponding to a "%c" format specifier is of type int,
and is converted to unsigned char.  Conversion to unsigned char is well
defined for values outside the range of unsigned char (the value wraps
around), which can be useful if the argument is a negative char value
promoted to int.

Typically UCHAR_MAX is 255, so the value after conversion will be >= 0
and <= 255 (note "<=", not "<").  Exotic implementations might have
UCHAR_MAX > 255, but such implementations are typically freestanding,
and therefore needn't support printf.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#379605

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2023-12-13 00:40 -0500
Message-ID	<ulbg3i$1m9v$1@dont-email.me>
In reply to	#379598

On 12/12/23 22:05, spender wrote:
> printf("%c",ch), the ch must <0xFF, <255

The only 'ch' in the code that you responded to was declared as "char
*", not char, and that value was used with a "%s" format specifier, for
which char* is the appropriate type.
*ch has char type, and as such must have a value between CHAR_MIN and
CHAR_MAX. If char is signed, CHAR_MIN == SCHAR_MIN, and SCHAR_MIN <=
-128. If char is unsigned, CHAR_MAX == UCHAR_MAX, and UCHAR_MAX >= 255.
Those are inequalities, not equalities, because 8 is the minimum value
for CHAR_BIT, rather than the only permitted value, and there are
real-world systems with other sizes (not many, to be fair), with
CHAR_BIT==16 being the most common alternative.

When ch is passed to printf(), it's gets converted to unsigned char. The
maximum resulting value is UCHAR_MAX, which as noted above, is allowed
to be >255.

> In c lang, The character must be a character of an ASCII table, i.e. <

There is no such requirement. The standard explicitly describes the
encoding recognized by C standard library functions such as printf() as
implementation-defined and locale-dependent, and describes it as a
multibyte encoding, though MB_CUR_MAX and MB_LEN_MAX are both allowed to
== 1.

On most Unix-like platforms, the default encoding is UTF-8. For
characters that can be represented in a single byte, that is equivalent
to 7-bit ASCII, not 8-bit, so the maximum is 127, not 255. There are
also a number of other encodings still in use, such as EBCDIC.

The standard only mentions ASCII twice, both times in non-normative
footnotes:
"17) The trigraph sequences enable the input of characters that are not
defined in the Invariant Code Set as described in ISO/IEC 646, which is
a subset of the seven-bit US ASCII code set."

In footnote 215 it mentions 7-bit ASCII as an example, not as something
that is mandated.

[toc] | [prev] | [next] | [standalone]

#380474

From	Tim Rentsch <tr.17687@z991.linuxsc.com>
Date	2024-01-19 07:43 -0800
Message-ID	<86frytjplg.fsf@linuxsc.com>
In reply to	#379605

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

> On 12/12/23 22:05, spender wrote:
>
>> printf("%c",ch), the ch must <0xFF, <255
>
> The only 'ch' in the code that you responded to was declared as
> "char *", not char, [...]

The posting in question also gave declarations

    char ch = [...];

and

    wchar_t ch = [...];

[toc] | [prev] | [next] | [standalone]

#379607

From	Lew Pitcher <lew.pitcher@digitalfreehold.ca>
Date	2023-12-13 14:56 +0000
Message-ID	<ulcgm5$sopg$1@dont-email.me>
In reply to	#379598

On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:

> printf("%c",ch), the ch must <0xFF, <255

Not quite.
1) ch /must/ represent an integer value.
2) ch /should/ represent a C char value. Note that a C char /is not/
   defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
   with implementation-defined sign, where CHAR_BIT is /at least/
   8 bits. printf() will happily /mis-interpret/ any other integer
   for you, when given the '%c' format specifier.

> In c lang, The character must be a character of an ASCII table, i.e. <
> (int)255. A string is a collection of characters.

Nonsense.

1) The C language does /not/ specify the representation
   of char, other than it's size in bits and whether or not it carries
   a sign. The C language has been implemented in EBCDIC environments
   (for instance), which is not even close to ASCII.

2) ASCII is a 7-bit encoding scheme; all valid ASCII values exist between
   0 and 127. /Some software/ extend ASCII to 8 bits, with the high-order
   bit either extending the characterset, or representing some
   meta-characteristic (such as parity or sign).

-- 
Lew Pitcher
"In Skills We Trust"

[toc] | [prev] | [next] | [standalone]

#379635

From	Tim Rentsch <tr.17687@z991.linuxsc.com>
Date	2023-12-25 02:03 -0800
Message-ID	<86wmt2tx80.fsf@linuxsc.com>
In reply to	#379607

Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>
>> printf("%c",ch), the ch must <0xFF, <255
>
> Not quite.
> 1) ch /must/ represent an integer value.

More specifically, it must have a type that is or promotes
to int, or a type that is or promotes to unsigned int, with
a value that is in the common range of int and unsigned int.

> 2) ch /should/ represent a C char value.  Note that a C char /is not/
>    defined as an 8-bit unsigned quantity, but as a CHAR_BIT quantity,
>    with implementation-defined sign, where CHAR_BIT is /at least/
>    8 bits.  [...]

This part isn't exactly right.  Any value in the range of char
is okay.  However, any value in the range of unsigned char is
also okay.  The type 'int' for the argument is meant to include
values returned by, for example, getchar(), and such functions
always return non-negative values (not counting EOF).  The rules
for character input/output functions generally convert characters
to unsigned char, and such values are meant to be admissible as
arguments for a %c conversion specifier.

[toc] | [prev] | [next] | [standalone]

#379639

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2023-12-25 14:43 -0800
Message-ID	<87bkadx5s6.fsf@nosuchdomain.example.com>
In reply to	#379635

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>
>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>
>>> printf("%c",ch), the ch must <0xFF, <255
>>
>> Not quite.
>> 1) ch /must/ represent an integer value.
>
> More specifically, it must have a type that is or promotes
> to int, or a type that is or promotes to unsigned int, with
> a value that is in the common range of int and unsigned int.

Not quite.  "If no l length modifier is present, the int argument is
converted to an unsigned char, and the resulting character is written."
For example printf("%c", -193) is equivalent to printf("%c", 63), which
assuming an ASCII-based character set will print '?'.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#380541

From	Tim Rentsch <tr.17687@z991.linuxsc.com>
Date	2024-01-20 09:33 -0800
Message-ID	<8634urkiyx.fsf@linuxsc.com>
In reply to	#379639

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:
>>
>>> On Wed, 13 Dec 2023 11:05:45 +0800, spender wrote:
>>>
>>>> printf("%c",ch), the ch must <0xFF, <255
>>>
>>> Not quite.
>>> 1) ch /must/ represent an integer value.
>>
>> More specifically, it must have a type that is or promotes
>> to int, or a type that is or promotes to unsigned int, with
>> a value that is in the common range of int and unsigned int.
>
> Not quite.  "If no l length modifier is present, the int argument
> is converted to an unsigned char, and the resulting character is
> written."  For example printf("%c", -193) is equivalent to
> printf("%c", 63), which assuming an ASCII-based character set will
> print '?'.

The rule for arguments to printf() is the same as the rule for
accessing variadic arguments using va_arg().  That has always
been true, although not expressed clearly in early versions of
the C standard.  Fortunately that shortcoming is addressed in
the upcoming C23 (is it still not yet ratified?):  in N3096,
paragraph 9 in section 7.23.6.1 says in part

    fprintf shall behave as if it uses va_arg with a type
    argument naming the type resulting from applying the
    default argument promotions to the type corresponding
    to the conversion specification [...]

and the rule for va_arg (in 7.16.1.1 p2) says in part 

    one type is a signed integer type, the other type is
    the corresponding unsigned integer type, and the value
    is representable in both types

So supplying an unsigned int argument is okay, provided of
course the value is in the range of values of signed int.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Simple(?) Unicode questions

Contents

#379587 — Simple(?) Unicode questions

#379588

#379589

#379591

#379596

#379590

#379592

#379593

#379595

#379594

#379597

#379598

#379600

#379601

#379605

#380474

#379607

#379635

#379639

#380541