Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #394598 > unrolled thread
| Started by | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| First post | 2025-10-20 15:35 -0300 |
| Last post | 2025-12-16 14:59 -0600 |
| Articles | 16 — 5 participants |
Back to article view | Back to comp.lang.c
u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-20 15:35 -0300
Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-20 15:19 -0700
Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 07:57 -0300
Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-21 10:26 -0700
Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 15:04 -0300
Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-10-21 11:51 -0700
Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 16:17 -0300
Re: u8"" c11 c23 Tim Rentsch <tr.17687@z991.linuxsc.com> - 2025-12-15 11:13 -0800
Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-15 14:27 -0800
Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-12-16 07:57 -0300
Re: u8"" c11 c23 Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2025-12-16 04:17 -0800
Re: u8"" c11 c23 Tim Rentsch <tr.17687@z991.linuxsc.com> - 2025-12-21 22:37 -0800
Re: u8"" c11 c23 Bonita Montero <Bonita.Montero@gmail.com> - 2025-10-21 10:35 +0200
Re: u8"" c11 c23 Thiago Adams <thiago.adams@gmail.com> - 2025-10-21 07:07 -0300
Re: u8"" c11 c23 Bonita Montero <Bonita.Montero@gmail.com> - 2025-10-21 12:09 +0200
Re: u8"" c11 c23 BGB <cr88192@gmail.com> - 2025-12-16 14:59 -0600
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-10-20 15:35 -0300 |
| Subject | u8"" c11 c23 |
| Message-ID | <10d5vck$3kufd$1@dont-email.me> |
speaking on signed x unsigned, u8"a" in C11 had the type char [N]. Normally char is signed in C23 it is unsigned char8_t [N]. when converting code from c11 to c23 we have a error here const char* s = u8"" I generally "cast char* " to "unsigned char*" when handling something with utf8. I am not u8"" , I use just " " with utf8 encoded source code and I just assume const char* is utf8.
[toc] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-10-20 15:19 -0700 |
| Message-ID | <875xc9p674.fsf@example.invalid> |
| In reply to | #394598 |
Thiago Adams <thiago.adams@gmail.com> writes:
> speaking on signed x unsigned,
>
> u8"a" in C11 had the type char [N]. Normally char is signed
I would have said "commonly" rather than "normally". Not an
important point.
> in C23 it is unsigned char8_t [N].
>
> when converting code from c11 to c23 we have a error here
> const char* s = u8""
>
>
> I generally "cast char* " to "unsigned char*" when handling something
> with utf8. I am not u8"" , I use just " " with utf8 encoded source
> code and I just assume const char* is utf8.
That raises another issue.
The <uchar.h> header was introduced in C99. In C99, C11, and C17,
that header defines char16_t and char32_t. C23 introduces char8_t.
There doesn't seem to be any way, other than checking the value of
__STDC_VERSION__ to determine whether char8_t is defined or not. There
are not *_MIN or *_MAX macros for these types, either in <uchar.h> or in
<limits.h>. A test program I just wrote would have been a little
simpler if I could have used `#ifdef CHAR8_MAX`.
Here's the test program :
#include <stdio.h>
#include <uchar.h>
#define TYPEOF(x) \
(_Generic(x, \
char: "char", \
signed char: "signed char", \
unsigned char: "unsigned char", \
short: "short", \
unsigned short: "unsigned short", \
int: "int", \
unsigned int: "unsigned int", \
long: "long", \
unsigned long: "unsigned long", \
long long: "long long", \
unsigned long long: "unsigned long long"))
int main(void) {
printf("__STDC_VERSION__ = %ldL\n", __STDC_VERSION__);
printf("u8\"a\"[0] is of type %s\n",
TYPEOF(u8"a"[0]));
#if __STDC_VERSION__ >= 202311L
printf("char8_t is %s\n", TYPEOF((char8_t)0));
#endif
printf("char16_t is %s\n", TYPEOF((char16_t)0));
printf("char32_t is %s\n", TYPEOF((char32_t)0));
}
Its output with `gcc -std=c17` :
__STDC_VERSION__ = 201710L
u8"a"[0] is of type char
char16_t is unsigned short
char32_t is unsigned int
Its output with `gcc -std=c23` :
__STDC_VERSION__ = 202311L
u8"a"[0] is of type unsigned char
char8_t is unsigned char
char16_t is unsigned short
char32_t is unsigned int
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-10-21 07:57 -0300 |
| Message-ID | <10d7ouh$3rq3g$1@dont-email.me> |
| In reply to | #394603 |
On 10/20/2025 7:19 PM, Keith Thompson wrote: > Thiago Adams <thiago.adams@gmail.com> writes: >> speaking on signed x unsigned, >> >> u8"a" in C11 had the type char [N]. Normally char is signed > > I would have said "commonly" rather than "normally". Not an > important point. > >> in C23 it is unsigned char8_t [N]. >> >> when converting code from c11 to c23 we have a error here >> const char* s = u8"" >> >> >> I generally "cast char* " to "unsigned char*" when handling something >> with utf8. I am not u8"" , I use just " " with utf8 encoded source >> code and I just assume const char* is utf8. > > That raises another issue. > > The <uchar.h> header was introduced in C99. In C99, C11, and C17, > that header defines char16_t and char32_t. C23 introduces char8_t. > I think for all these typedefs related with language concepts, like size_t which is related with sizeof, char8_t which is related with u8"" char16_t u"", char32_t U""... etc.. should be built-in typedefs. And even others that does not have a association with language features like int16_t.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-10-21 10:26 -0700 |
| Message-ID | <87o6q0np3b.fsf@example.invalid> |
| In reply to | #394630 |
Thiago Adams <thiago.adams@gmail.com> writes:
> On 10/20/2025 7:19 PM, Keith Thompson wrote:
[...]
>> That raises another issue.
>> The <uchar.h> header was introduced in C99. In C99, C11, and C17,
>> that header defines char16_t and char32_t. C23 introduces char8_t.
>
> I think for all these typedefs related with language concepts, like
> size_t which is related with sizeof, char8_t which is related with
> u8"" char16_t u"", char32_t U""... etc.. should be built-in typedefs.
>
> And even others that does not have a association with language
> features like int16_t.
By "built-in typedefs", do you mean typedefs that are visible without
a #include?
That would be unprecedented, but I suppose it could work. But I'm not
sure it would be all that advantageous. The type of the result of
sizeof is some implementation-defined unsigned integer type. The
<stddef.h> header merely provides a consistent name for that type.
I can see that having language features depend (indirectly) on types
defined in library headers is a bit messy, but I don't think it causes
any real problems.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-10-21 15:04 -0300 |
| Message-ID | <10d8hv0$3rg4$1@dont-email.me> |
| In reply to | #394633 |
On 10/21/2025 2:26 PM, Keith Thompson wrote:
> Thiago Adams <thiago.adams@gmail.com> writes:
>> On 10/20/2025 7:19 PM, Keith Thompson wrote:
> [...]
>>> That raises another issue.
>>> The <uchar.h> header was introduced in C99. In C99, C11, and C17,
>>> that header defines char16_t and char32_t. C23 introduces char8_t.
>>
>> I think for all these typedefs related with language concepts, like
>> size_t which is related with sizeof, char8_t which is related with
>> u8"" char16_t u"", char32_t U""... etc.. should be built-in typedefs.
>>
>> And even others that does not have a association with language
>> features like int16_t.
>
> By "built-in typedefs", do you mean typedefs that are visible without
> a #include?
>
yes.
> That would be unprecedented, but I suppose it could work. But I'm not
> sure it would be all that advantageous. The type of the result of
> sizeof is some implementation-defined unsigned integer type. The
> <stddef.h> header merely provides a consistent name for that type.
>
> I can see that having language features depend (indirectly) on types
> defined in library headers is a bit messy, but I don't think it causes
> any real problems.
>
It's not really a problem, but it depends on the includes, which in turn
depend on the preprocessor.
It seems like the language is partially configured through macros and
typedefs in includes.
Some types that have direct relation with the language:
typedef typeof_unqual(sizeof(0)) size_t;
typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t;
typedef typeof_unqual(u8' ') char8_t;
typedef typeof_unqual(u' ') char16_t;
typedef typeof_unqual(U' ') char32_t;
typedef typeof_unqual(L' ') wchar_t;
typedef typeof_unqual(nullptr) nullptr_t;
I think it does not make sense to have to include a file to describe
size_t because we can use sizeof without having to include anything.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-10-21 11:51 -0700 |
| Message-ID | <87jz0onl4z.fsf@example.invalid> |
| In reply to | #394634 |
Thiago Adams <thiago.adams@gmail.com> writes:
> On 10/21/2025 2:26 PM, Keith Thompson wrote:
>> Thiago Adams <thiago.adams@gmail.com> writes:
>>> On 10/20/2025 7:19 PM, Keith Thompson wrote:
>> [...]
>>>> That raises another issue.
>>>> The <uchar.h> header was introduced in C99. In C99, C11, and C17,
>>>> that header defines char16_t and char32_t. C23 introduces char8_t.
>>>
>>> I think for all these typedefs related with language concepts, like
>>> size_t which is related with sizeof, char8_t which is related with
>>> u8"" char16_t u"", char32_t U""... etc.. should be built-in typedefs.
>>>
>>> And even others that does not have a association with language
>>> features like int16_t.
>> By "built-in typedefs", do you mean typedefs that are visible
>> without
>> a #include?
>>
>
> yes.
>
>> That would be unprecedented, but I suppose it could work. But I'm not
>> sure it would be all that advantageous. The type of the result of
>> sizeof is some implementation-defined unsigned integer type. The
>> <stddef.h> header merely provides a consistent name for that type.
>> I can see that having language features depend (indirectly) on types
>> defined in library headers is a bit messy, but I don't think it causes
>> any real problems.
>>
>
>
> It's not really a problem, but it depends on the includes, which in
> turn depend on the preprocessor.
>
> It seems like the language is partially configured through macros and
> typedefs in includes.
The way I'd describe it is that the type of a sizeof expression is
chosen by the compiler, and the definition of size_t in <stddef.h>
documents that choice and makes it visible to programmers.
> Some types that have direct relation with the language:
>
> typedef typeof_unqual(sizeof(0)) size_t;
> typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t;
> typedef typeof_unqual(u8' ') char8_t;
> typedef typeof_unqual(u' ') char16_t;
> typedef typeof_unqual(U' ') char32_t;
> typedef typeof_unqual(L' ') wchar_t;
> typedef typeof_unqual(nullptr) nullptr_t;
>
> I think it does not make sense to have to include a file to describe
> size_t because we can use sizeof without having to include anything.
I suppose if I were defining a new language from scratch, I probably
wouldn't have those types defined in library headers. I might have
made size_t a keyword, for example.
One data point: C++ has wchar_t as a keyword, while C defines it as
a typedef in <sddef.h>. C++'s wchar_t has the same representation
as one of the other integral types, called its underlying type.
That could have been a nice approach for C, but I'd say it's too
late to fix it, and the benefits aren't worth the cost.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-10-21 16:17 -0300 |
| Message-ID | <10d8m83$5a9a$1@dont-email.me> |
| In reply to | #394640 |
On 10/21/2025 3:51 PM, Keith Thompson wrote: > Thiago Adams <thiago.adams@gmail.com> writes: >> On 10/21/2025 2:26 PM, Keith Thompson wrote: >>> Thiago Adams <thiago.adams@gmail.com> writes: >>>> On 10/20/2025 7:19 PM, Keith Thompson wrote: >>> [...] >>>>> That raises another issue. >>>>> The <uchar.h> header was introduced in C99. In C99, C11, and C17, >>>>> that header defines char16_t and char32_t. C23 introduces char8_t. >>>> >>>> I think for all these typedefs related with language concepts, like >>>> size_t which is related with sizeof, char8_t which is related with >>>> u8"" char16_t u"", char32_t U""... etc.. should be built-in typedefs. >>>> >>>> And even others that does not have a association with language >>>> features like int16_t. >>> By "built-in typedefs", do you mean typedefs that are visible >>> without >>> a #include? >>> >> >> yes. >> >>> That would be unprecedented, but I suppose it could work. But I'm not >>> sure it would be all that advantageous. The type of the result of >>> sizeof is some implementation-defined unsigned integer type. The >>> <stddef.h> header merely provides a consistent name for that type. >>> I can see that having language features depend (indirectly) on types >>> defined in library headers is a bit messy, but I don't think it causes >>> any real problems. >>> >> >> >> It's not really a problem, but it depends on the includes, which in >> turn depend on the preprocessor. >> >> It seems like the language is partially configured through macros and >> typedefs in includes. > > The way I'd describe it is that the type of a sizeof expression is > chosen by the compiler, and the definition of size_t in <stddef.h> > documents that choice and makes it visible to programmers. > >> Some types that have direct relation with the language: >> >> typedef typeof_unqual(sizeof(0)) size_t; >> typedef typeof_unqual(((char*)1)-((char*)0)) ptrdiff_t; >> typedef typeof_unqual(u8' ') char8_t; >> typedef typeof_unqual(u' ') char16_t; >> typedef typeof_unqual(U' ') char32_t; >> typedef typeof_unqual(L' ') wchar_t; >> typedef typeof_unqual(nullptr) nullptr_t; >> >> I think it does not make sense to have to include a file to describe >> size_t because we can use sizeof without having to include anything. > > I suppose if I were defining a new language from scratch, I probably > wouldn't have those types defined in library headers. I might have > made size_t a keyword, for example. > > One data point: C++ has wchar_t as a keyword, while C defines it as > a typedef in <sddef.h>. C++'s wchar_t has the same representation > as one of the other integral types, called its underlying type. > That could have been a nice approach for C, but I'd say it's too > late to fix it, and the benefits aren't worth the cost. > yes I think keywords make sense. In some ways, all C types are typedefs for the "real" types.
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2025-12-15 11:13 -0800 |
| Message-ID | <86h5trtv72.fsf@linuxsc.com> |
| In reply to | #394603 |
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
> Thiago Adams <thiago.adams@gmail.com> writes:
>
>> speaking on signed x unsigned,
>>
>> u8"a" in C11 had the type char [N]. Normally char is signed
>
> I would have said "commonly" rather than "normally". Not an
> important point.
>
>> in C23 it is unsigned char8_t [N].
>>
>> when converting code from c11 to c23 we have a error here
>> const char* s = u8""
>>
>>
>> I generally "cast char* " to "unsigned char*" when handling
>> something with utf8. I am not u8"" , I use just " " with utf8
>> encoded source code and I just assume const char* is utf8.
>
> That raises another issue.
>
> The <uchar.h> header was introduced in C99. In C99, C11, and C17,
> that header defines char16_t and char32_t. C23 introduces char8_t.
>
> There doesn't seem to be any way, other than checking the value of
> __STDC_VERSION__ to determine whether char8_t is defined or not.
> There are not *_MIN or *_MAX macros for these types, either in
> <uchar.h> or in <limits.h>. A test program I just wrote would have
> been a little simpler if I could have used `#ifdef CHAR8_MAX`.
>
> Here's the test program :
>
> #include <stdio.h>
> #include <uchar.h>
>
> #define TYPEOF(x) \
> (_Generic(x, \
> char: "char", \
> signed char: "signed char", \
> unsigned char: "unsigned char", \
> short: "short", \
> unsigned short: "unsigned short", \
> int: "int", \
> unsigned int: "unsigned int", \
> long: "long", \
> unsigned long: "unsigned long", \
> long long: "long long", \
> unsigned long long: "unsigned long long"))
>
> int main(void) {
> printf("__STDC_VERSION__ = %ldL\n", __STDC_VERSION__);
> printf("u8\"a\"[0] is of type %s\n",
> TYPEOF(u8"a"[0]));
> #if __STDC_VERSION__ >= 202311L
> printf("char8_t is %s\n", TYPEOF((char8_t)0));
> #endif
> printf("char16_t is %s\n", TYPEOF((char16_t)0));
> printf("char32_t is %s\n", TYPEOF((char32_t)0));
> }
>
> Its output with `gcc -std=c17` :
>
> __STDC_VERSION__ = 201710L
> u8"a"[0] is of type char
> char16_t is unsigned short
> char32_t is unsigned int
>
> Its output with `gcc -std=c23` :
>
> __STDC_VERSION__ = 202311L
> u8"a"[0] is of type unsigned char
> char8_t is unsigned char
> char16_t is unsigned short
> char32_t is unsigned int
Since C23 defines char8_t to be the same type as unsigned char,
it seems better to just define it when it isn't there:
#include <limits.h>
#if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
typedef unsigned char char8_t;
#endif
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-12-15 14:27 -0800 |
| Message-ID | <87ldj3tm7l.fsf@example.invalid> |
| In reply to | #395822 |
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
[...]
>> The <uchar.h> header was introduced in C99. In C99, C11, and C17,
>> that header defines char16_t and char32_t. C23 introduces char8_t.
>>
>> There doesn't seem to be any way, other than checking the value of
>> __STDC_VERSION__ to determine whether char8_t is defined or not.
>> There are not *_MIN or *_MAX macros for these types, either in
>> <uchar.h> or in <limits.h>. A test program I just wrote would have
>> been a little simpler if I could have used `#ifdef CHAR8_MAX`.
[...]
> Since C23 defines char8_t to be the same type as unsigned char,
> it seems better to just define it when it isn't there:
>
> #include <limits.h>
>
> #if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
> typedef unsigned char char8_t;
> #endif
Yes. And the test for CHAR_BIT may not be necessary, depending on the
programmer's intent. char8_t is the same type as unsigned char even if
CHAR_BIT > 8. Similarly, char16_t and char32_t are the same type as
uint_least16_t and uint_least32_t, respectively.
But before C23, u8"a" is a syntax error.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-12-16 07:57 -0300 |
| Message-ID | <10hrdun$ejvt$1@dont-email.me> |
| In reply to | #395827 |
On 12/15/2025 7:27 PM, Keith Thompson wrote: ... > But before C23, u8"a" is a syntax error. > u8"a" was introduced in C11. u8'a' was introduced in C23.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <Keith.S.Thompson+u@gmail.com> |
|---|---|
| Date | 2025-12-16 04:17 -0800 |
| Message-ID | <87h5tqtycm.fsf@example.invalid> |
| In reply to | #395829 |
Thiago Adams <thiago.adams@gmail.com> writes:
> On 12/15/2025 7:27 PM, Keith Thompson wrote:
> ...
>> But before C23, u8"a" is a syntax error.
>
> u8"a" was introduced in C11.
> u8'a' was introduced in C23.
Thank you, I stand corrected.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
[toc] | [prev] | [next] | [standalone]
| From | Tim Rentsch <tr.17687@z991.linuxsc.com> |
|---|---|
| Date | 2025-12-21 22:37 -0800 |
| Message-ID | <86pl87rpic.fsf@linuxsc.com> |
| In reply to | #395827 |
Keith Thompson <Keith.S.Thompson+u@gmail.com> writes: > Tim Rentsch <tr.17687@z991.linuxsc.com> writes: > >> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes: > > [...] > >>> The <uchar.h> header was introduced in C99. In C99, C11, and C17, >>> that header defines char16_t and char32_t. C23 introduces char8_t. >>> >>> There doesn't seem to be any way, other than checking the value of >>> __STDC_VERSION__ to determine whether char8_t is defined or not. >>> There are not *_MIN or *_MAX macros for these types, either in >>> <uchar.h> or in <limits.h>. A test program I just wrote would have >>> been a little simpler if I could have used `#ifdef CHAR8_MAX`. > > [...] > >> Since C23 defines char8_t to be the same type as unsigned char, >> it seems better to just define it when it isn't there: >> >> #include <limits.h> >> >> #if CHAR_BIT == 8 && __STDC_VERSION__ < 202311 >> typedef unsigned char char8_t; >> #endif > > Yes. And the test for CHAR_BIT may not be necessary, depending on > the programmer's intent. char8_t is the same type as unsigned char > even if CHAR_BIT > 8. That's humorous. It's like a name designed to be confusing or misleading. But thank you for the information, I wouldn't have guessed it. > Similarly, char16_t and char32_t are the same type as > uint_least16_t and uint_least32_t, respectively. Kind of weird, but at least it's consistent, and it explains why char8_t is the same as unsigned char. Then again, why not uint_least8_t? Has C23 changed to the point where unsigned char and uint_least8_t have to be the same type? My recollection is that in earlier editions of the C standard it is possible, even if unlikely, for these types to be distinct.
[toc] | [prev] | [next] | [standalone]
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2025-10-21 10:35 +0200 |
| Message-ID | <10d7gkt$3phdk$1@raubtier-asyl.eternal-september.org> |
| In reply to | #394598 |
Am 20.10.2025 um 20:35 schrieb Thiago Adams: > speaking on signed x unsigned, > > u8"a" in C11 had the type char [N]. Normally char is signed > > in C23 it is unsigned char8_t [N]. > > when converting code from c11 to c23 we have a error here > const char* s = u8"" > > > > > > > I generally "cast char* " to "unsigned char*" when handling something > with utf8. I am not u8"" , I use just " " with utf8 encoded source code > and I just assume const char* is utf8. > > What is there to discuss ? Just cast and that's it.
[toc] | [prev] | [next] | [standalone]
| From | Thiago Adams <thiago.adams@gmail.com> |
|---|---|
| Date | 2025-10-21 07:07 -0300 |
| Message-ID | <10d7m1v$3qtfe$1@dont-email.me> |
| In reply to | #394624 |
Em 21/10/2025 05:35, Bonita Montero escreveu: > Am 20.10.2025 um 20:35 schrieb Thiago Adams: >> speaking on signed x unsigned, >> >> u8"a" in C11 had the type char [N]. Normally char is signed >> >> in C23 it is unsigned char8_t [N]. >> >> when converting code from c11 to c23 we have a error here >> const char* s = u8"" >> >> >> >> >> >> >> I generally "cast char* " to "unsigned char*" when handling something >> with utf8. I am not u8"" , I use just " " with utf8 encoded source code >> and I just assume const char* is utf8. >> >> > > What is there to discuss ? Just cast and that's it. When converting code from c11 to c23 we have a error here const char* s = u8"" I think it is a big change..the ones C does not normally do.
[toc] | [prev] | [next] | [standalone]
| From | Bonita Montero <Bonita.Montero@gmail.com> |
|---|---|
| Date | 2025-10-21 12:09 +0200 |
| Message-ID | <10d7m44$3quev$1@raubtier-asyl.eternal-september.org> |
| In reply to | #394626 |
Am 21.10.2025 um 12:07 schrieb Thiago Adams: > Em 21/10/2025 05:35, Bonita Montero escreveu: >> Am 20.10.2025 um 20:35 schrieb Thiago Adams: >>> speaking on signed x unsigned, >>> >>> u8"a" in C11 had the type char [N]. Normally char is signed >>> >>> in C23 it is unsigned char8_t [N]. >>> >>> when converting code from c11 to c23 we have a error here >>> const char* s = u8"" >>> >>> >>> >>> >>> >>> >>> I generally "cast char* " to "unsigned char*" when handling something >>> with utf8. I am not u8"" , I use just " " with utf8 encoded source code >>> and I just assume const char* is utf8. >>> >>> >> >> What is there to discuss ? Just cast and that's it. > > When converting code from c11 to c23 we have a error here > const char* s = u8"" No, because the null-terminator doesn't become negative with that. ;-) > > I think it is a big change..the ones C does not normally do. > > >
[toc] | [prev] | [next] | [standalone]
| From | BGB <cr88192@gmail.com> |
|---|---|
| Date | 2025-12-16 14:59 -0600 |
| Message-ID | <10hsh6l$30btn$1@dont-email.me> |
| In reply to | #394598 |
On 10/20/2025 1:35 PM, Thiago Adams wrote:
> speaking on signed x unsigned,
>
> u8"a" in C11 had the type char [N]. Normally char is signed
>
> in C23 it is unsigned char8_t [N].
>
> when converting code from c11 to c23 we have a error here
> const char* s = u8""
>
>
>
>
>
>
> I generally "cast char* " to "unsigned char*" when handling something
> with utf8. I am not u8"" , I use just " " with utf8 encoded source code
> and I just assume const char* is utf8.
>
It may not be so simple, as source-code bytes don't necessarily map 1:1
with string literal bytes (and are more likely to be translated than
passed through as-is).
Implicitly, it may depend on the default locale and similar assumed by
the C compiler.
If the source-code is UTF-8, and the default locale is UTF-8, then OK.
More conservative though is to assume that the default locale's
character encoding is potentially something like 8859-1 or 1252, which
will not preserve UTF-8 codepoints if not mapped into an area supported
by the relevant encoding (so, things may get remapped).
So, you need a UTF-8 string literal or similar to specify that the
string does in-fact encode text as UTF-8.
In a compiler, one may need to try to detect and deal with text
encoding, say:
ASCII text:
No BOM, limited range of characters
(0x20..0xx7E, 0x09, 0x0D, 0x0A, etc).
UTF-8:
Also Includes 80..EF
Only allow valid codepoint sequences
May include a BOM
8859-1 or 1252:
Includes 80..FF, excludes text which is also valid as UTF-8.
No BOM.
Other encodings possible,
Like 437 / KOI-8 / JIS / etc,
but far less common than 1252.
No good way to distinguish them reliably.
UTF-16 (*1):
Even number of bytes
Strongly hinted if even or odd bytes are frequently NUL;
Frequent even NUL: UTF-16, likely big-endian;
Frequent odd NUL: UTF-16, likely little-endian;
Excluded if matching the pattern for one of the above;
If text is valid ASCII or UTF-8, assume these instead.
May include a BOM.
*1: More commonly produced by older versions of Visual Studio or
Notepad, if a non-ASCII codepoint was present. Newer versions tend to
default to UTF-8 instead.
Compiler may normalize on UTF-8 or similar internally, but this again
doesn't mean it can be assumed for string literals (which are more
likely to be mashed into 1252 or something, such as with a compiler like
MSVC).
Though, that said, does seem that GCC defaults to assuming UTF-8 if
nothing else is specified. So, UTF-8 => UTF-8 with default string
literals may be workable if one also assumes that the code is always
compiled with GCC or similar.
Though, curiously, it seems newer MSVC will still use UTF-8 with a
default string literal if the character is given as "\uXXXX", but will
use a single-byte encoding in other cases.
Checking, newer versions of MSVC are also aware of u8 literals.
...
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.c
csiph-web