Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #394598 > unrolled thread

u8"" c11 c23

Started byThiago Adams <thiago.adams@gmail.com>
First post2025-10-20 15:35 -0300
Last post2025-12-16 14:59 -0600
Articles 16 — 5 participants

Back to article view | Back to comp.lang.c


Contents

  u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-20 15:35 -0300
    Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-20 15:19 -0700
      Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 07:57 -0300
        Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-21 10:26 -0700
          Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 15:04 -0300
            Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-21 11:51 -0700
              Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 16:17 -0300
      Re: u8"" c11 c23 Tim Rentsch <tr.17687@z991.linuxsc.com> - 2025-12-15 11:13 -0800
        Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-15 14:27 -0800
          Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-12-16 07:57 -0300
            Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-16 04:17 -0800
          Re: u8"" c11 c23 Tim Rentsch <tr.17687@z991.linuxsc.com> - 2025-12-21 22:37 -0800
    Re: u8"" c11 c23 Bonita Montero <Bonita.Montero@gmail.com> - 2025-10-21 10:35 +0200
      Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 07:07 -0300
        Re: u8"" c11 c23 Bonita Montero <Bonita.Montero@gmail.com> - 2025-10-21 12:09 +0200
    Re: u8"" c11 c23 BGB <cr88192@gmail.com> - 2025-12-16 14:59 -0600

#394598 — u8"" c11 c23

FromThiago Adams <thiago.adams@gmail.com>
Date2025-10-20 15:35 -0300
Subjectu8"" c11 c23
Message-ID<10d5vck$3kufd$1@dont-email.me>
speaking on signed x unsigned,

u8"a"  in C11 had the type char [N]. Normally char is signed

in C23 it is unsigned char8_t  [N].

when converting code from c11 to c23 we have a error here
const char* s = u8""






I generally "cast char* " to "unsigned char*" when handling something 
with utf8. I am not u8"" , I use just " " with utf8 encoded source code
and I just assume const char*  is utf8.

[toc] | [next] | [standalone]


#394603

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2025-10-20 15:19 -0700
Message-ID<875xc9p674.fsf@example.invalid>
In reply to#394598
Thiago Adams <thiago.adams@gmail.com> writes:
> speaking on signed x unsigned,
>
> u8"a"  in C11 had the type char [N]. Normally char is signed

I would have said "commonly" rather than "normally".  Not an
important point.

> in C23 it is unsigned char8_t  [N].
>
> when converting code from c11 to c23 we have a error here
> const char* s = u8""
>
>
> I generally "cast char* " to "unsigned char*" when handling something
> with utf8. I am not u8"" , I use just " " with utf8 encoded source
> code and I just assume const char* is utf8.

That raises another issue.

The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
that header defines char16_t and char32_t.  C23 introduces char8_t.

There doesn't seem to be any way, other than checking the value of
__STDC_VERSION__ to determine whether char8_t is defined or not.  There
are not *_MIN or *_MAX macros for these types, either in <uchar.h> or in
<limits.h>.  A test program I just wrote would have been a little
simpler if I could have used `#ifdef CHAR8_MAX`.

Here's the test program :

#include <stdio.h>
#include <uchar.h>

#define TYPEOF(x) \
    (_Generic(x, \
        char: "char", \
        signed char: "signed char", \
        unsigned char: "unsigned char", \
        short: "short", \
        unsigned short: "unsigned short", \
        int: "int", \
        unsigned int: "unsigned int", \
        long: "long", \
        unsigned long: "unsigned long", \
        long long: "long long", \
        unsigned long long: "unsigned long long"))

int main(void) {
    printf("__STDC_VERSION__ = %ldL\n", __STDC_VERSION__);
    printf("u8\"a\"[0] is of type %s\n",
           TYPEOF(u8"a"[0]));
#if __STDC_VERSION__ >= 202311L
    printf("char8_t is %s\n", TYPEOF((char8_t)0));
#endif
    printf("char16_t is %s\n", TYPEOF((char16_t)0));
    printf("char32_t is %s\n", TYPEOF((char32_t)0));
}

Its output with `gcc -std=c17` :

__STDC_VERSION__ = 201710L
u8"a"[0] is of type char
char16_t is unsigned short
char32_t is unsigned int

Its output with `gcc -std=c23` :

__STDC_VERSION__ = 202311L
u8"a"[0] is of type unsigned char
char8_t is unsigned char
char16_t is unsigned short
char32_t is unsigned int

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#394630

FromThiago Adams <thiago.adams@gmail.com>
Date2025-10-21 07:57 -0300
Message-ID<10d7ouh$3rq3g$1@dont-email.me>
In reply to#394603
On 10/20/2025 7:19 PM, Keith Thompson wrote:
> Thiago Adams <thiago.adams@gmail.com> writes:
>> speaking on signed x unsigned,
>>
>> u8"a"  in C11 had the type char [N]. Normally char is signed
> 
> I would have said "commonly" rather than "normally".  Not an
> important point.
> 
>> in C23 it is unsigned char8_t  [N].
>>
>> when converting code from c11 to c23 we have a error here
>> const char* s = u8""
>>
>>
>> I generally "cast char* " to "unsigned char*" when handling something
>> with utf8. I am not u8"" , I use just " " with utf8 encoded source
>> code and I just assume const char* is utf8.
> 
> That raises another issue.
> 
> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
> that header defines char16_t and char32_t.  C23 introduces char8_t.
> 

I think for all these typedefs related with language concepts, like
size_t which is related with sizeof, char8_t which is related with u8"" 
char16_t u"", char32_t  U""... etc.. should be built-in typedefs.

And even others that does not have a association with language features 
like int16_t.



[toc] | [prev] | [next] | [standalone]


#394633

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2025-10-21 10:26 -0700
Message-ID<87o6q0np3b.fsf@example.invalid>
In reply to#394630
Thiago Adams <thiago.adams@gmail.com> writes:
> On 10/20/2025 7:19 PM, Keith Thompson wrote:
[...]
>> That raises another issue.
>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>
> I think for all these typedefs related with language concepts, like
> size_t which is related with sizeof, char8_t which is related with
> u8"" char16_t u"", char32_t  U""... etc.. should be built-in typedefs.
>
> And even others that does not have a association with language
> features like int16_t.

By "built-in typedefs", do you mean typedefs that are visible without
a #include?

That would be unprecedented, but I suppose it could work.  But I'm not
sure it would be all that advantageous.  The type of the result of
sizeof is some implementation-defined unsigned integer type.  The
<stddef.h> header merely provides a consistent name for that type.

I can see that having language features depend (indirectly) on types
defined in library headers is a bit messy, but I don't think it causes
any real problems.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#394634

FromThiago Adams <thiago.adams@gmail.com>
Date2025-10-21 15:04 -0300
Message-ID<10d8hv0$3rg4$1@dont-email.me>
In reply to#394633
On 10/21/2025 2:26 PM, Keith Thompson wrote:
> Thiago Adams <thiago.adams@gmail.com> writes:
>> On 10/20/2025 7:19 PM, Keith Thompson wrote:
> [...]
>>> That raises another issue.
>>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>>
>> I think for all these typedefs related with language concepts, like
>> size_t which is related with sizeof, char8_t which is related with
>> u8"" char16_t u"", char32_t  U""... etc.. should be built-in typedefs.
>>
>> And even others that does not have a association with language
>> features like int16_t.
> 
> By "built-in typedefs", do you mean typedefs that are visible without
> a #include?
>

yes.

> That would be unprecedented, but I suppose it could work.  But I'm not
> sure it would be all that advantageous.  The type of the result of
> sizeof is some implementation-defined unsigned integer type.  The
> <stddef.h> header merely provides a consistent name for that type.
> 
> I can see that having language features depend (indirectly) on types
> defined in library headers is a bit messy, but I don't think it causes
> any real problems.
> 


It's not really a problem, but it depends on the includes, which in turn 
depend on the preprocessor.

It seems like the language is partially configured through macros and 
typedefs in includes.


Some types that have direct relation with the language:

     typedef typeof_unqual(sizeof(0)) size_t;
     typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t;
     typedef typeof_unqual(u8' ') char8_t;
     typedef typeof_unqual(u' ') char16_t;
     typedef typeof_unqual(U' ') char32_t;
     typedef typeof_unqual(L' ') wchar_t;
     typedef typeof_unqual(nullptr) nullptr_t;



I think it does not make sense to have to include a file to describe 
size_t because we can use sizeof without having to include anything.


[toc] | [prev] | [next] | [standalone]


#394640

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2025-10-21 11:51 -0700
Message-ID<87jz0onl4z.fsf@example.invalid>
In reply to#394634
Thiago Adams <thiago.adams@gmail.com> writes:
> On 10/21/2025 2:26 PM, Keith Thompson wrote:
>> Thiago Adams <thiago.adams@gmail.com> writes:
>>> On 10/20/2025 7:19 PM, Keith Thompson wrote:
>> [...]
>>>> That raises another issue.
>>>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>>>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>>>
>>> I think for all these typedefs related with language concepts, like
>>> size_t which is related with sizeof, char8_t which is related with
>>> u8"" char16_t u"", char32_t  U""... etc.. should be built-in typedefs.
>>>
>>> And even others that does not have a association with language
>>> features like int16_t.
>> By "built-in typedefs", do you mean typedefs that are visible
>> without
>> a #include?
>>
>
> yes.
>
>> That would be unprecedented, but I suppose it could work.  But I'm not
>> sure it would be all that advantageous.  The type of the result of
>> sizeof is some implementation-defined unsigned integer type.  The
>> <stddef.h> header merely provides a consistent name for that type.
>> I can see that having language features depend (indirectly) on types
>> defined in library headers is a bit messy, but I don't think it causes
>> any real problems.
>> 
>
>
> It's not really a problem, but it depends on the includes, which in
> turn depend on the preprocessor.
>
> It seems like the language is partially configured through macros and
> typedefs in includes.

The way I'd describe it is that the type of a sizeof expression is
chosen by the compiler, and the definition of size_t in <stddef.h>
documents that choice and makes it visible to programmers.

> Some types that have direct relation with the language:
>
>     typedef typeof_unqual(sizeof(0)) size_t;
>     typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t;
>     typedef typeof_unqual(u8' ') char8_t;
>     typedef typeof_unqual(u' ') char16_t;
>     typedef typeof_unqual(U' ') char32_t;
>     typedef typeof_unqual(L' ') wchar_t;
>     typedef typeof_unqual(nullptr) nullptr_t;
>
> I think it does not make sense to have to include a file to describe
> size_t because we can use sizeof without having to include anything.

I suppose if I were defining a new language from scratch, I probably
wouldn't have those types defined in library headers.  I might have
made size_t a keyword, for example.

One data point: C++ has wchar_t as a keyword, while C defines it as
a typedef in <sddef.h>.  C++'s wchar_t has the same representation
as one of the other integral types, called its underlying type.
That could have been a nice approach for C, but I'd say it's too
late to fix it, and the benefits aren't worth the cost.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#394641

FromThiago Adams <thiago.adams@gmail.com>
Date2025-10-21 16:17 -0300
Message-ID<10d8m83$5a9a$1@dont-email.me>
In reply to#394640
On 10/21/2025 3:51 PM, Keith Thompson wrote:
> Thiago Adams <thiago.adams@gmail.com> writes:
>> On 10/21/2025 2:26 PM, Keith Thompson wrote:
>>> Thiago Adams <thiago.adams@gmail.com> writes:
>>>> On 10/20/2025 7:19 PM, Keith Thompson wrote:
>>> [...]
>>>>> That raises another issue.
>>>>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>>>>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>>>>
>>>> I think for all these typedefs related with language concepts, like
>>>> size_t which is related with sizeof, char8_t which is related with
>>>> u8"" char16_t u"", char32_t  U""... etc.. should be built-in typedefs.
>>>>
>>>> And even others that does not have a association with language
>>>> features like int16_t.
>>> By "built-in typedefs", do you mean typedefs that are visible
>>> without
>>> a #include?
>>>
>>
>> yes.
>>
>>> That would be unprecedented, but I suppose it could work.  But I'm not
>>> sure it would be all that advantageous.  The type of the result of
>>> sizeof is some implementation-defined unsigned integer type.  The
>>> <stddef.h> header merely provides a consistent name for that type.
>>> I can see that having language features depend (indirectly) on types
>>> defined in library headers is a bit messy, but I don't think it causes
>>> any real problems.
>>>
>>
>>
>> It's not really a problem, but it depends on the includes, which in
>> turn depend on the preprocessor.
>>
>> It seems like the language is partially configured through macros and
>> typedefs in includes.
> 
> The way I'd describe it is that the type of a sizeof expression is
> chosen by the compiler, and the definition of size_t in <stddef.h>
> documents that choice and makes it visible to programmers.
> 
>> Some types that have direct relation with the language:
>>
>>      typedef typeof_unqual(sizeof(0)) size_t;
>>      typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t;
>>      typedef typeof_unqual(u8' ') char8_t;
>>      typedef typeof_unqual(u' ') char16_t;
>>      typedef typeof_unqual(U' ') char32_t;
>>      typedef typeof_unqual(L' ') wchar_t;
>>      typedef typeof_unqual(nullptr) nullptr_t;
>>
>> I think it does not make sense to have to include a file to describe
>> size_t because we can use sizeof without having to include anything.
> 
> I suppose if I were defining a new language from scratch, I probably
> wouldn't have those types defined in library headers.  I might have
> made size_t a keyword, for example.
> 
> One data point: C++ has wchar_t as a keyword, while C defines it as
> a typedef in <sddef.h>.  C++'s wchar_t has the same representation
> as one of the other integral types, called its underlying type.
> That could have been a nice approach for C, but I'd say it's too
> late to fix it, and the benefits aren't worth the cost.
> 

yes I think keywords make sense.  In some ways, all C types are
typedefs for the "real" types.


[toc] | [prev] | [next] | [standalone]


#395822

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2025-12-15 11:13 -0800
Message-ID<86h5trtv72.fsf@linuxsc.com>
In reply to#394603
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Thiago Adams <thiago.adams@gmail.com> writes:
>
>> speaking on signed x unsigned,
>>
>> u8"a"  in C11 had the type char [N]. Normally char is signed
>
> I would have said "commonly" rather than "normally".  Not an
> important point.
>
>> in C23 it is unsigned char8_t  [N].
>>
>> when converting code from c11 to c23 we have a error here
>> const char* s = u8""
>>
>>
>> I generally "cast char* " to "unsigned char*" when handling
>> something with utf8.  I am not u8"" , I use just " " with utf8
>> encoded source code and I just assume const char* is utf8.
>
> That raises another issue.
>
> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
> that header defines char16_t and char32_t.  C23 introduces char8_t.
>
> There doesn't seem to be any way, other than checking the value of
> __STDC_VERSION__ to determine whether char8_t is defined or not.
> There are not *_MIN or *_MAX macros for these types, either in
> <uchar.h> or in <limits.h>.  A test program I just wrote would have
> been a little simpler if I could have used `#ifdef CHAR8_MAX`.
>
> Here's the test program :
>
> #include <stdio.h>
> #include <uchar.h>
>
> #define TYPEOF(x) \
>     (_Generic(x, \
>         char:  "char", \
>         signed char:  "signed char", \
>         unsigned char:  "unsigned char", \
>         short:  "short", \
>         unsigned short:  "unsigned short", \
>         int:  "int", \
>         unsigned int:  "unsigned int", \
>         long:  "long", \
>         unsigned long:  "unsigned long", \
>         long long:  "long long", \
>         unsigned long long:  "unsigned long long"))
>
> int main(void) {
>     printf("__STDC_VERSION__ = %ldL\n", __STDC_VERSION__);
>     printf("u8\"a\"[0] is of type %s\n",
>            TYPEOF(u8"a"[0]));
> #if __STDC_VERSION__ >= 202311L
>     printf("char8_t is %s\n", TYPEOF((char8_t)0));
> #endif
>     printf("char16_t is %s\n", TYPEOF((char16_t)0));
>     printf("char32_t is %s\n", TYPEOF((char32_t)0));
> }
>
> Its output with `gcc -std=c17` :
>
> __STDC_VERSION__ = 201710L
> u8"a"[0] is of type char
> char16_t is unsigned short
> char32_t is unsigned int
>
> Its output with `gcc -std=c23` :
>
> __STDC_VERSION__ = 202311L
> u8"a"[0] is of type unsigned char
> char8_t is unsigned char
> char16_t is unsigned short
> char32_t is unsigned int

Since C23 defines char8_t to be the same type as unsigned char,
it seems better to just define it when it isn't there:

    #include <limits.h>

    #if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
    typedef unsigned char char8_t;
    #endif

[toc] | [prev] | [next] | [standalone]


#395827

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2025-12-15 14:27 -0800
Message-ID<87ldj3tm7l.fsf@example.invalid>
In reply to#395822
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
[...]
>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>>
>> There doesn't seem to be any way, other than checking the value of
>> __STDC_VERSION__ to determine whether char8_t is defined or not.
>> There are not *_MIN or *_MAX macros for these types, either in
>> <uchar.h> or in <limits.h>.  A test program I just wrote would have
>> been a little simpler if I could have used `#ifdef CHAR8_MAX`.

[...]

> Since C23 defines char8_t to be the same type as unsigned char,
> it seems better to just define it when it isn't there:
>
>     #include <limits.h>
>
>     #if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
>     typedef unsigned char char8_t;
>     #endif

Yes.  And the test for CHAR_BIT may not be necessary, depending on the
programmer's intent.  char8_t is the same type as unsigned char even if
CHAR_BIT > 8.  Similarly, char16_t and char32_t are the same type as
uint_least16_t and uint_least32_t, respectively.

But before C23, u8"a" is a syntax error.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#395829

FromThiago Adams <thiago.adams@gmail.com>
Date2025-12-16 07:57 -0300
Message-ID<10hrdun$ejvt$1@dont-email.me>
In reply to#395827
On 12/15/2025 7:27 PM, Keith Thompson wrote:
...
> But before C23, u8"a" is a syntax error.
> 

u8"a" was introduced in C11.
u8'a' was introduced in C23.



[toc] | [prev] | [next] | [standalone]


#395830

FromKeith Thompson <Keith.S.Thompson+u@gmail.com>
Date2025-12-16 04:17 -0800
Message-ID<87h5tqtycm.fsf@example.invalid>
In reply to#395829
Thiago Adams <thiago.adams@gmail.com> writes:
> On 12/15/2025 7:27 PM, Keith Thompson wrote:
> ...
>> But before C23, u8"a" is a syntax error.
>
> u8"a" was introduced in C11.
> u8'a' was introduced in C23.

Thank you, I stand corrected.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]


#395875

FromTim Rentsch <tr.17687@z991.linuxsc.com>
Date2025-12-21 22:37 -0800
Message-ID<86pl87rpic.fsf@linuxsc.com>
In reply to#395827
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>
> [...]
>
>>> The <uchar.h> header was introduced in C99.  In C99, C11, and C17,
>>> that header defines char16_t and char32_t.  C23 introduces char8_t.
>>>
>>> There doesn't seem to be any way, other than checking the value of
>>> __STDC_VERSION__ to determine whether char8_t is defined or not.
>>> There are not *_MIN or *_MAX macros for these types, either in
>>> <uchar.h> or in <limits.h>.  A test program I just wrote would have
>>> been a little simpler if I could have used `#ifdef CHAR8_MAX`.
>
> [...]
>
>> Since C23 defines char8_t to be the same type as unsigned char,
>> it seems better to just define it when it isn't there:
>>
>>     #include <limits.h>
>>
>>     #if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
>>     typedef unsigned char char8_t;
>>     #endif
>
> Yes.  And the test for CHAR_BIT may not be necessary, depending on
> the programmer's intent.  char8_t is the same type as unsigned char
> even if CHAR_BIT > 8.

That's humorous.  It's like a name designed to be confusing or
misleading.  But thank you for the information, I wouldn't have
guessed it.

> Similarly, char16_t and char32_t are the same type as
> uint_least16_t and uint_least32_t, respectively.

Kind of weird, but at least it's consistent, and it explains why
char8_t is the same as unsigned char.  Then again, why not
uint_least8_t?  Has C23 changed to the point where unsigned char
and uint_least8_t have to be the same type?  My recollection is
that in earlier editions of the C standard it is possible, even
if unlikely, for these types to be distinct.

[toc] | [prev] | [next] | [standalone]


#394624

FromBonita Montero <Bonita.Montero@gmail.com>
Date2025-10-21 10:35 +0200
Message-ID<10d7gkt$3phdk$1@raubtier-asyl.eternal-september.org>
In reply to#394598
Am 20.10.2025 um 20:35 schrieb Thiago Adams:
> speaking on signed x unsigned,
> 
> u8"a"  in C11 had the type char [N]. Normally char is signed
> 
> in C23 it is unsigned char8_t  [N].
> 
> when converting code from c11 to c23 we have a error here
> const char* s = u8""
> 
> 
> 
> 
> 
> 
> I generally "cast char* " to "unsigned char*" when handling something 
> with utf8. I am not u8"" , I use just " " with utf8 encoded source code
> and I just assume const char*  is utf8.
> 
> 

What is there to discuss ? Just cast and that's it.

[toc] | [prev] | [next] | [standalone]


#394626

FromThiago Adams <thiago.adams@gmail.com>
Date2025-10-21 07:07 -0300
Message-ID<10d7m1v$3qtfe$1@dont-email.me>
In reply to#394624
Em 21/10/2025 05:35, Bonita Montero escreveu:
> Am 20.10.2025 um 20:35 schrieb Thiago Adams:
>> speaking on signed x unsigned,
>>
>> u8"a"  in C11 had the type char [N]. Normally char is signed
>>
>> in C23 it is unsigned char8_t  [N].
>>
>> when converting code from c11 to c23 we have a error here
>> const char* s = u8""
>>
>>
>>
>>
>>
>>
>> I generally "cast char* " to "unsigned char*" when handling something 
>> with utf8. I am not u8"" , I use just " " with utf8 encoded source code
>> and I just assume const char*  is utf8.
>>
>>
> 
> What is there to discuss ? Just cast and that's it.

When converting code from c11 to c23 we have a error here
const char* s = u8""

I think it is a big change..the ones C does not normally do.


[toc] | [prev] | [next] | [standalone]


#394627

FromBonita Montero <Bonita.Montero@gmail.com>
Date2025-10-21 12:09 +0200
Message-ID<10d7m44$3quev$1@raubtier-asyl.eternal-september.org>
In reply to#394626
Am 21.10.2025 um 12:07 schrieb Thiago Adams:
> Em 21/10/2025 05:35, Bonita Montero escreveu:
>> Am 20.10.2025 um 20:35 schrieb Thiago Adams:
>>> speaking on signed x unsigned,
>>>
>>> u8"a"  in C11 had the type char [N]. Normally char is signed
>>>
>>> in C23 it is unsigned char8_t  [N].
>>>
>>> when converting code from c11 to c23 we have a error here
>>> const char* s = u8""
>>>
>>>
>>>
>>>
>>>
>>>
>>> I generally "cast char* " to "unsigned char*" when handling something 
>>> with utf8. I am not u8"" , I use just " " with utf8 encoded source code
>>> and I just assume const char*  is utf8.
>>>
>>>
>>
>> What is there to discuss ? Just cast and that's it.
> 
> When converting code from c11 to c23 we have a error here
> const char* s = u8""

No, because the null-terminator doesn't become negative with that.
;-)

> 
> I think it is a big change..the ones C does not normally do.
> 
> 
> 

[toc] | [prev] | [next] | [standalone]


#395832

FromBGB <cr88192@gmail.com>
Date2025-12-16 14:59 -0600
Message-ID<10hsh6l$30btn$1@dont-email.me>
In reply to#394598
On 10/20/2025 1:35 PM, Thiago Adams wrote:
> speaking on signed x unsigned,
> 
> u8"a"  in C11 had the type char [N]. Normally char is signed
> 
> in C23 it is unsigned char8_t  [N].
> 
> when converting code from c11 to c23 we have a error here
> const char* s = u8""
> 
> 
> 
> 
> 
> 
> I generally "cast char* " to "unsigned char*" when handling something 
> with utf8. I am not u8"" , I use just " " with utf8 encoded source code
> and I just assume const char*  is utf8.
> 
It may not be so simple, as source-code bytes don't necessarily map 1:1 
with string literal bytes (and are more likely to be translated than 
passed through as-is).

Implicitly, it may depend on the default locale and similar assumed by 
the C compiler.

If the source-code is UTF-8, and the default locale is UTF-8, then OK.

More conservative though is to assume that the default locale's 
character encoding is potentially something like 8859-1 or 1252, which 
will not preserve UTF-8 codepoints if not mapped into an area supported 
by the relevant encoding (so, things may get remapped).

So, you need a UTF-8 string literal or similar to specify that the 
string does in-fact encode text as UTF-8.



In a compiler, one may need to try to detect and deal with text 
encoding, say:
   ASCII text:
     No BOM, limited range of characters
       (0x20..0xx7E, 0x09, 0x0D, 0x0A, etc).
   UTF-8:
     Also Includes 80..EF
     Only allow valid codepoint sequences
     May include a BOM
   8859-1 or 1252:
     Includes 80..FF, excludes text which is also valid as UTF-8.
     No BOM.
     Other encodings possible,
       Like 437 / KOI-8 / JIS / etc,
       but far less common than 1252.
       No good way to distinguish them reliably.
   UTF-16 (*1):
     Even number of bytes
     Strongly hinted if even or odd bytes are frequently NUL;
       Frequent even NUL: UTF-16, likely big-endian;
       Frequent odd NUL: UTF-16, likely little-endian;
     Excluded if matching the pattern for one of the above;
       If text is valid ASCII or UTF-8, assume these instead.
     May include a BOM.

*1: More commonly produced by older versions of Visual Studio or 
Notepad, if a non-ASCII codepoint was present. Newer versions tend to 
default to UTF-8 instead.

Compiler may normalize on UTF-8 or similar internally, but this again 
doesn't mean it can be assumed for string literals (which are more 
likely to be mashed into 1252 or something, such as with a compiler like 
MSVC).


Though, that said, does seem that GCC defaults to assuming UTF-8 if 
nothing else is specified. So, UTF-8 => UTF-8 with default string 
literals may be workable if one also assumes that the code is always 
compiled with GCC or similar.

Though, curiously, it seems newer MSVC will still use UTF-8 with a 
default string literal if the character is given as "\uXXXX", but will 
use a single-byte encoding in other cases.

Checking, newer versions of MSVC are also aware of u8 literals.

...

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.c


csiph-web