Groups > comp.lang.c > #387544 > unrolled thread

multi bytes character - how to make it defined behavior?

Started by	Thiago Adams <thiago.adams@gmail.com>
First post	2024-08-13 11:45 -0300
Last post	2024-08-13 23:44 -0400
Articles	19 — 6 participants

Back to article view | Back to comp.lang.c

  multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-13 11:45 -0300
    Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 00:52 +0100
      Re: multi bytes character - how to make it defined behavior? Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2024-08-13 17:33 -0700
        Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 08:41 -0300
          Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 14:05 +0100
            Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 10:31 -0300
              Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 16:34 +0100
                Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 13:10 -0300
                  Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 13:27 -0300
                  Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 18:07 +0100
                    Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 14:40 -0300
                      Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 19:12 +0100
                        Re: multi bytes character - how to make it defined behavior? Thiago Adams <thiago.adams@gmail.com> - 2024-08-14 15:28 -0300
                          Re: multi bytes character - how to make it defined behavior? Bart <bc@freeuk.com> - 2024-08-14 20:32 +0100
                  Re: multi bytes character - how to make it defined behavior? Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-15 02:43 +0000
              Re: multi bytes character - how to make it defined behavior? Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-15 02:41 +0000
            Re: multi bytes character - how to make it defined behavior? Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-15 01:39 +0000
    Re: multi bytes character - how to make it defined behavior? Ben Bacarisse <ben@bsb.me.uk> - 2024-08-14 01:32 +0100
    Re: multi bytes character - how to make it defined behavior? Richard Damon <richard@damon-family.org> - 2024-08-13 23:44 -0400

#387544 — multi bytes character - how to make it defined behavior?

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-13 11:45 -0300
Subject	multi bytes character - how to make it defined behavior?
Message-ID	<v9frim$3u7qi$1@dont-email.me>

static_assert('×' == 50071);

GCC -  warning multi byte
CLANG - error character too large

I think instead of "multi bytes" we need "multi characters" - not bytes.

We decode utf8 then we have the character to decide if it is multi char 
or not.

decoding '×' would consume bytes 195 and 151 the result is the decoded 
Unicode value of 215.

It is not multi byte : 256*195 + 151 = 50071

O the other hand 'ab' is "multi character" resulting

256 * 'a' + 'b' = 256*97+98= 24930

One consequence is that

'ab' == '𤤰'

But I don't think this is a problem. At least everything is defined.

[toc] | [next] | [standalone]

#387550

From	Bart <bc@freeuk.com>
Date	2024-08-14 00:52 +0100
Message-ID	<v9grjd$4cjd$1@dont-email.me>
In reply to	#387544

On 13/08/2024 15:45, Thiago Adams wrote:
> static_assert('×' == 50071);
> 
> GCC -  warning multi byte
> CLANG - error character too large
> 
> I think instead of "multi bytes" we need "multi characters" - not bytes.
> 
> We decode utf8 then we have the character to decide if it is multi char 
> or not.
> 
> decoding '×' would consume bytes 195 and 151 the result is the decoded 
> Unicode value of 215.
> 
> It is not multi byte : 256*195 + 151 = 50071
> 
> O the other hand 'ab' is "multi character" resulting
> 
> 256 * 'a' + 'b' = 256*97+98= 24930
> 
> One consequence is that
> 
> 'ab' == '𤤰'
> 
> But I don't think this is a problem. At least everything is defined.

What exactly do you mean by multi-byte characters? Is it a literal such 
as 'ABCD'?

I've no idea what C makes of that, so you will first have to specify 
what it might represent:

* Is it a single character represented by multiple bytes?

* If so, do those multiple bytes specify a Unicode number (2-3 bytes), 
or a UTF8 sequence (up to 4 bytes, maybe more)?

* If those multiple sequence are allowed, could you have more than one 
mixed ASCII/Unicode/UTF8 characters?

One problem with UTF8 in C character literals is that I believe those 
are limited to an 'int' type, so 32 bits. You can't fit much in there. 
And once you have such a value, how do you print it?

Some of this you can take care of in your 'cake' product, and 
superimpose a particular spec on top of C (maybe they can be extended to 
64 bits) but you probably can't do much about 'printf'.

(In my language, I overhauled this part of it earlier this year. There 
it works like this:

* Character literals can be 64 bits

* They can represent up to 8 ASCII characters: 'ABCDEFGH'

* They can include escape codes for both Unicode and UTF8, and multiple
   such characters can be specified:

    'A\u20ACB'            # All represent A€B; this is Unicode
    'A\h EC 82 AC\B'      # This is UTF8
    'A\xEC\x82\xACB'      # C-style escape

   Internally they are stored as UTF8, so the 20AC is converted to UTF8

* The ordering of the characters matches that of the equivalent
   "A\e20ACB" string when stored in memory; but this applies only to
   little-endian

* Print routines have options to print the first character (which can be
   a Unicode one), or the whole sequence)

Another aspect is when typing Unicode text directly via your text editor 
instead of using escape codes; will the C source be UTF8, or some other 
encoding? This will affect how the text is represented, and how much you 
can fit into one 32/64-bit literal.

[toc] | [prev] | [next] | [standalone]

#387553

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2024-08-13 17:33 -0700
Message-ID	<87sev8eydx.fsf@nosuchdomain.example.com>
In reply to	#387550

Bart <bc@freeuk.com> writes:
[...]
> What exactly do you mean by multi-byte characters? Is it a literal
> such as 'ABCD'?
>
> I've no idea what C makes of that,

It's a character constant of type int with an implementation-defined
value.  Read the section on "Character constants" in the C standard
(6.4.4.4 in C17).

(With gcc, its value is 0x41424344, but other compilers can and do
behave differently.)

We discussed this at some length several years ago.

[...]

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#387564

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 08:41 -0300
Message-ID	<v9i54d$e3c6$1@dont-email.me>
In reply to	#387553

On 13/08/2024 21:33, Keith Thompson wrote:
> Bart<bc@freeuk.com>  writes:
> [...]
>> What exactly do you mean by multi-byte characters? Is it a literal
>> such as 'ABCD'?
>>
>> I've no idea what C makes of that,
> It's a character constant of type int with an implementation-defined
> value.  Read the section on "Character constants" in the C standard
> (6.4.4.4 in C17).
> 
> (With gcc, its value is 0x41424344, but other compilers can and do
> behave differently.)
> 
> We discussed this at some length several years ago.
> 
> [...]

"An integer character constant has type int. The value of an integer 
character constant containing
a single character that maps to a single value in the literal encoding 
(6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding 
interpreted as an integer.
The value of an integer character constant containing more than one 
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single 
value in the literal encoding,
is implementation-defined. If an integer character constant contains a 
single character or escape
sequence, its value is the one that results when an object with type 
char whose value is that of the
single character or escape sequence is converted to type int."

I am suggesting the define this:

"The value of an integer character constant containing more than one 
character (e.g. ’ab’), or containing a character or escape sequence that 
does not map to a single value in the literal encoding, is 
implementation-defined."

How?

First, all source code should be utf8.

Then I am suggesting we first decode the bytes.

For instance, '×' is encoded with 195 and 151. We consume these 2 bytes 
and the utf8 decoded value is 215.

Then this is the defined behavior

static_assert('×' == 215)

In case we have 'ab' for instance:
Fist we decode 'a' 97 then 'b' 98. We consume one byte each.
Then we have two characters. In this case we do

256 * 'a' + 'b' = 256*97+98= 24930

static_assert('ab' == 24930)

I believe this static_assert('ab' == 24930) matches the way it is used 
today.

In case the value is bigger than MAX_INT I think it should be unsigned int.

Why?

Adding fixes on top of fixes make the language bigger and complex.
Like adding U'' L'' u8'' etc.

In my source code I use only utf8, everything just works without any 
u8"" etc.

[toc] | [prev] | [next] | [standalone]

#387565

From	Bart <bc@freeuk.com>
Date	2024-08-14 14:05 +0100
Message-ID	<v9ia2i$f3p2$1@dont-email.me>
In reply to	#387564

On 14/08/2024 12:41, Thiago Adams wrote:
> On 13/08/2024 21:33, Keith Thompson wrote:
>> Bart<bc@freeuk.com>  writes:
>> [...]
>>> What exactly do you mean by multi-byte characters? Is it a literal
>>> such as 'ABCD'?
>>>
>>> I've no idea what C makes of that,
>> It's a character constant of type int with an implementation-defined
>> value.  Read the section on "Character constants" in the C standard
>> (6.4.4.4 in C17).
>>
>> (With gcc, its value is 0x41424344, but other compilers can and do
>> behave differently.)
>>
>> We discussed this at some length several years ago.
>>
>> [...]
> 
> 
> "An integer character constant has type int. The value of an integer 
> character constant containing
> a single character that maps to a single value in the literal encoding 
> (6.2.9) is the numerical value
> of the representation of the mapped character in the literal encoding 
> interpreted as an integer.
> The value of an integer character constant containing more than one 
> character (e.g. ’ab’), or
> containing a character or escape sequence that does not map to a single 
> value in the literal encoding,
> is implementation-defined. If an integer character constant contains a 
> single character or escape
> sequence, its value is the one that results when an object with type 
> char whose value is that of the
> single character or escape sequence is converted to type int."
> 
> 
> I am suggesting the define this:
> 
> "The value of an integer character constant containing more than one 
> character (e.g. ’ab’), or containing a character or escape sequence that 
> does not map to a single value in the literal encoding, is 
> implementation-defined."
> 
> How?
> 
> First, all source code should be utf8.
> 
> Then I am suggesting we first decode the bytes.
> 
> For instance, '×' is encoded with 195 and 151. We consume these 2 bytes 
> and the utf8 decoded value is 215.

By that you mean the Unicode index. But you say elsewhere that 
everything in your source code is UTF8.

Where then does the 215 appear? Do your char* strings use 215 for ×, or 
do they use 195 and 215?

I think this is why C requires those prefixes like u8'...'.

> 
> Then this is the defined behavior
> 
> static_assert('×' == 215)

This is where you need to decide whether the integer value within '...', 
AT RUNTIME, represents the Unicode index or the UTF8 sequence.

(In my language, though I do very little with Unicode ATM, I decided 
that everything is UTF8 both at compile time and runtime. Unless I 
explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either 
will work), which contains 21-bit Unicode index values.)

I get the impression that C's wide characters are intended for those 
Unicode indices, but that's not going to work well on Windows with its 
16-bit wide character type.

[toc] | [prev] | [next] | [standalone]

#387566

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 10:31 -0300
Message-ID	<v9ibkf$e3c6$2@dont-email.me>
In reply to	#387565

On 14/08/2024 10:05, Bart wrote:
> On 14/08/2024 12:41, Thiago Adams wrote:
>> On 13/08/2024 21:33, Keith Thompson wrote:
>>> Bart<bc@freeuk.com>  writes:
>>> [...]
>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>> such as 'ABCD'?
>>>>
>>>> I've no idea what C makes of that,
>>> It's a character constant of type int with an implementation-defined
>>> value.  Read the section on "Character constants" in the C standard
>>> (6.4.4.4 in C17).
>>>
>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>> behave differently.)
>>>
>>> We discussed this at some length several years ago.
>>>
>>> [...]
>>
>>
>> "An integer character constant has type int. The value of an integer 
>> character constant containing
>> a single character that maps to a single value in the literal encoding 
>> (6.2.9) is the numerical value
>> of the representation of the mapped character in the literal encoding 
>> interpreted as an integer.
>> The value of an integer character constant containing more than one 
>> character (e.g. ’ab’), or
>> containing a character or escape sequence that does not map to a 
>> single value in the literal encoding,
>> is implementation-defined. If an integer character constant contains a 
>> single character or escape
>> sequence, its value is the one that results when an object with type 
>> char whose value is that of the
>> single character or escape sequence is converted to type int."
>>
>>
>> I am suggesting the define this:
>>
>> "The value of an integer character constant containing more than one 
>> character (e.g. ’ab’), or containing a character or escape sequence 
>> that does not map to a single value in the literal encoding, is 
>> implementation-defined."
>>
>> How?
>>
>> First, all source code should be utf8.
>>
>> Then I am suggesting we first decode the bytes.
>>
>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>> bytes and the utf8 decoded value is 215.
> 
> By that you mean the Unicode index. But you say elsewhere that 
> everything in your source code is UTF8.


215 is the unicode number of the character '×'.

> Where then does the 215 appear? Do your char* strings use 215 for ×, or 
> do they use 195 and 215?

215 is the result of decoding two utf8 encoded bytes. (195 and 151)

> I think this is why C requires those prefixes like u8'...'.

>>
>> Then this is the defined behavior
>>
>> static_assert('×' == 215)
> 
> This is where you need to decide whether the integer value within '...', 
> AT RUNTIME, represents the Unicode index or the UTF8 sequence.

why runtime? It is compile time. This is why source code must be 
universally encoded (utf8)


> (In my language, though I do very little with Unicode ATM, I decided 
> that everything is UTF8 both at compile time and runtime. Unless I 
> explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either 
> will work), which contains 21-bit Unicode index values.)
> 
> I get the impression that C's wide characters are intended for those 
> Unicode indices, but that's not going to work well on Windows with its 
> 16-bit wide character type.
> 

nowadays wide is just for windows API compatibility.

[toc] | [prev] | [next] | [standalone]

#387568

From	Bart <bc@freeuk.com>
Date	2024-08-14 16:34 +0100
Message-ID	<v9iipe$gl5i$1@dont-email.me>
In reply to	#387566

On 14/08/2024 14:31, Thiago Adams wrote:
> On 14/08/2024 10:05, Bart wrote:
>> On 14/08/2024 12:41, Thiago Adams wrote:
>>> On 13/08/2024 21:33, Keith Thompson wrote:
>>>> Bart<bc@freeuk.com>  writes:
>>>> [...]
>>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>>> such as 'ABCD'?
>>>>>
>>>>> I've no idea what C makes of that,
>>>> It's a character constant of type int with an implementation-defined
>>>> value.  Read the section on "Character constants" in the C standard
>>>> (6.4.4.4 in C17).
>>>>
>>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>>> behave differently.)
>>>>
>>>> We discussed this at some length several years ago.
>>>>
>>>> [...]
>>>
>>>
>>> "An integer character constant has type int. The value of an integer 
>>> character constant containing
>>> a single character that maps to a single value in the literal 
>>> encoding (6.2.9) is the numerical value
>>> of the representation of the mapped character in the literal encoding 
>>> interpreted as an integer.
>>> The value of an integer character constant containing more than one 
>>> character (e.g. ’ab’), or
>>> containing a character or escape sequence that does not map to a 
>>> single value in the literal encoding,
>>> is implementation-defined. If an integer character constant contains 
>>> a single character or escape
>>> sequence, its value is the one that results when an object with type 
>>> char whose value is that of the
>>> single character or escape sequence is converted to type int."
>>>
>>>
>>> I am suggesting the define this:
>>>
>>> "The value of an integer character constant containing more than one 
>>> character (e.g. ’ab’), or containing a character or escape sequence 
>>> that does not map to a single value in the literal encoding, is 
>>> implementation-defined."
>>>
>>> How?
>>>
>>> First, all source code should be utf8.
>>>
>>> Then I am suggesting we first decode the bytes.
>>>
>>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>>> bytes and the utf8 decoded value is 215.
>>
>> By that you mean the Unicode index. But you say elsewhere that 
>> everything in your source code is UTF8.
> 
> 
> 215 is the unicode number of the character '×'.
> 
>> Where then does the 215 appear? Do your char* strings use 215 for ×, 
>> or do they use 195 and 215?
> 
> 215 is the result of decoding two utf8 encoded bytes. (195 and 151)
> 
>> I think this is why C requires those prefixes like u8'...'.
> 
>>>
>>> Then this is the defined behavior
>>>
>>> static_assert('×' == 215)
>>
>> This is where you need to decide whether the integer value within 
>> '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
> 
> why runtime? It is compile time. This is why source code must be 
> universally encoded (utf8)


In that case I don't understand what you are testing for here. Is it an 
error for '×' to be 215, or an error for it not to be?

And what is the test for, to ensure encoding is UTF8 in this ... source 
file? ... compiler?

Where would the 'decoded 215' come into it?

[toc] | [prev] | [next] | [standalone]

#387569

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 13:10 -0300
Message-ID	<v9iksr$gvc8$1@dont-email.me>
In reply to	#387568

On 14/08/2024 12:34, Bart wrote:
> On 14/08/2024 14:31, Thiago Adams wrote:
>> On 14/08/2024 10:05, Bart wrote:
>>> On 14/08/2024 12:41, Thiago Adams wrote:
>>>> On 13/08/2024 21:33, Keith Thompson wrote:
>>>>> Bart<bc@freeuk.com>  writes:
>>>>> [...]
>>>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>>>> such as 'ABCD'?
>>>>>>
>>>>>> I've no idea what C makes of that,
>>>>> It's a character constant of type int with an implementation-defined
>>>>> value.  Read the section on "Character constants" in the C standard
>>>>> (6.4.4.4 in C17).
>>>>>
>>>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>>>> behave differently.)
>>>>>
>>>>> We discussed this at some length several years ago.
>>>>>
>>>>> [...]
>>>>
>>>>
>>>> "An integer character constant has type int. The value of an integer 
>>>> character constant containing
>>>> a single character that maps to a single value in the literal 
>>>> encoding (6.2.9) is the numerical value
>>>> of the representation of the mapped character in the literal 
>>>> encoding interpreted as an integer.
>>>> The value of an integer character constant containing more than one 
>>>> character (e.g. ’ab’), or
>>>> containing a character or escape sequence that does not map to a 
>>>> single value in the literal encoding,
>>>> is implementation-defined. If an integer character constant contains 
>>>> a single character or escape
>>>> sequence, its value is the one that results when an object with type 
>>>> char whose value is that of the
>>>> single character or escape sequence is converted to type int."
>>>>
>>>>
>>>> I am suggesting the define this:
>>>>
>>>> "The value of an integer character constant containing more than one 
>>>> character (e.g. ’ab’), or containing a character or escape sequence 
>>>> that does not map to a single value in the literal encoding, is 
>>>> implementation-defined."
>>>>
>>>> How?
>>>>
>>>> First, all source code should be utf8.
>>>>
>>>> Then I am suggesting we first decode the bytes.
>>>>
>>>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>>>> bytes and the utf8 decoded value is 215.
>>>
>>> By that you mean the Unicode index. But you say elsewhere that 
>>> everything in your source code is UTF8.
>>
>>
>> 215 is the unicode number of the character '×'.
>>
>>> Where then does the 215 appear? Do your char* strings use 215 for ×, 
>>> or do they use 195 and 215?
>>
>> 215 is the result of decoding two utf8 encoded bytes. (195 and 151)
>>
>>> I think this is why C requires those prefixes like u8'...'.
>>
>>>>
>>>> Then this is the defined behavior
>>>>
>>>> static_assert('×' == 215)
>>>
>>> This is where you need to decide whether the integer value within 
>>> '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
>>
>> why runtime? It is compile time. This is why source code must be 
>> universally encoded (utf8)
> 
> 
> In that case I don't understand what you are testing for here. Is it an 
> error for '×' to be 215, or an error for it not to be?


GCC handles this as multibyte. Without decoding.

The result of GCC is 50071
static_assert('×' == 50071);

The explanation is that GCC is doing:

256*195 + 151 = 50071

(Remember the utf8 bytes were 195 151)

The way 'ab' is handled is the same of '×' on GCC. Clang have a error 
for that. The standard just says the value is implementation defined.

> And what is the test for, to ensure encoding is UTF8 in this ... source 
> file? ... compiler?

MSVC has some checks, I don't know that is the logic.


> Where would the 'decoded 215' come into it?

215 is the value after decoding utf8 and producing the unicode value.

So my suggestion is decode first.

The bad part of my suggestion we may have two different ways of 
producing the same value.

For instance the number generated by ab is the same of

'ab' == '𤤰'

The advantage is to converge to utf8 unicode and make it specified.

[toc] | [prev] | [next] | [standalone]

#387570

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 13:27 -0300
Message-ID	<v9ilte$gvc8$2@dont-email.me>
In reply to	#387569

On 14/08/2024 13:10, Thiago Adams wrote:
> On 14/08/2024 12:34, Bart wrote:
>> On 14/08/2024 14:31, Thiago Adams wrote:
>>> On 14/08/2024 10:05, Bart wrote:
>>>> On 14/08/2024 12:41, Thiago Adams wrote:
>>>>> On 13/08/2024 21:33, Keith Thompson wrote:
>>>>>> Bart<bc@freeuk.com>  writes:
>>>>>> [...]
>>>>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>>>>> such as 'ABCD'?
>>>>>>>
>>>>>>> I've no idea what C makes of that,
>>>>>> It's a character constant of type int with an implementation-defined
>>>>>> value.  Read the section on "Character constants" in the C standard
>>>>>> (6.4.4.4 in C17).
>>>>>>
>>>>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>>>>> behave differently.)
>>>>>>
>>>>>> We discussed this at some length several years ago.
>>>>>>
>>>>>> [...]
>>>>>
>>>>>
>>>>> "An integer character constant has type int. The value of an 
>>>>> integer character constant containing
>>>>> a single character that maps to a single value in the literal 
>>>>> encoding (6.2.9) is the numerical value
>>>>> of the representation of the mapped character in the literal 
>>>>> encoding interpreted as an integer.
>>>>> The value of an integer character constant containing more than one 
>>>>> character (e.g. ’ab’), or
>>>>> containing a character or escape sequence that does not map to a 
>>>>> single value in the literal encoding,
>>>>> is implementation-defined. If an integer character constant 
>>>>> contains a single character or escape
>>>>> sequence, its value is the one that results when an object with 
>>>>> type char whose value is that of the
>>>>> single character or escape sequence is converted to type int."
>>>>>
>>>>>
>>>>> I am suggesting the define this:
>>>>>
>>>>> "The value of an integer character constant containing more than 
>>>>> one character (e.g. ’ab’), or containing a character or escape 
>>>>> sequence that does not map to a single value in the literal 
>>>>> encoding, is implementation-defined."
>>>>>
>>>>> How?
>>>>>
>>>>> First, all source code should be utf8.
>>>>>
>>>>> Then I am suggesting we first decode the bytes.
>>>>>
>>>>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>>>>> bytes and the utf8 decoded value is 215.
>>>>
>>>> By that you mean the Unicode index. But you say elsewhere that 
>>>> everything in your source code is UTF8.
>>>
>>>
>>> 215 is the unicode number of the character '×'.
>>>
>>>> Where then does the 215 appear? Do your char* strings use 215 for ×, 
>>>> or do they use 195 and 215?
>>>
>>> 215 is the result of decoding two utf8 encoded bytes. (195 and 151)
>>>
>>>> I think this is why C requires those prefixes like u8'...'.
>>>
>>>>>
>>>>> Then this is the defined behavior
>>>>>
>>>>> static_assert('×' == 215)
>>>>
>>>> This is where you need to decide whether the integer value within 
>>>> '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
>>>
>>> why runtime? It is compile time. This is why source code must be 
>>> universally encoded (utf8)
>>
>>
>> In that case I don't understand what you are testing for here. Is it 
>> an error for '×' to be 215, or an error for it not to be?
> 
> 
> GCC handles this as multibyte. Without decoding.
> 
> The result of GCC is 50071
> static_assert('×' == 50071);
> 
> The explanation is that GCC is doing:
> 
> 256*195 + 151 = 50071
> 
> (Remember the utf8 bytes were 195 151)
> 
> The way 'ab' is handled is the same of '×' on GCC. Clang have a error 
> for that. The standard just says the value is implementation defined.
> 
>> And what is the test for, to ensure encoding is UTF8 in this ... 
>> source file? ... compiler?
> 
> MSVC has some checks, I don't know that is the logic.
> 
> 
>> Where would the 'decoded 215' come into it?
> 
> 215 is the value after decoding utf8 and producing the unicode value.
> 
> So my suggestion is decode first.
> 
> The bad part of my suggestion we may have two different ways of 
> producing the same value.
> 
> For instance the number generated by ab is the same of
> 
> 'ab' == '𤤰'
> 
> The advantage is to converge to utf8 unicode and make it specified.
> 
> 
> 

I use multibyte chars in my code.

For instance:
enum token {TK_EQUAL == '=='}

I prefer to write and read token.type == '==' rather than
token.type = TK_EQUAL.

An alternative for me also could be a macro.

if (token.type = MC('=', '=')) {...}

but then its worst than the type = TK_EQUAL

[toc] | [prev] | [next] | [standalone]

#387571

From	Bart <bc@freeuk.com>
Date	2024-08-14 18:07 +0100
Message-ID	<v9io8c$h8v8$1@dont-email.me>
In reply to	#387569

On 14/08/2024 17:10, Thiago Adams wrote:
> On 14/08/2024 12:34, Bart wrote:

>> In that case I don't understand what you are testing for here. Is it 
>> an error for '×' to be 215, or an error for it not to be?
> 
> 
> GCC handles this as multibyte. Without decoding.
> 
> The result of GCC is 50071
> static_assert('×' == 50071);
> 
> The explanation is that GCC is doing:
> 
> 256*195 + 151 = 50071

So the 50071 is the 2-byte UTF8 sequence.

> (Remember the utf8 bytes were 195 151)
> 
> The way 'ab' is handled is the same of '×' on GCC.

I don't understand. 'a' and 'b' each occupy one byte. Together they need 
two bytes.

Where's the problem? Are you perhaps confused as to what UTF8 is?

The 50071 above is much better expressed as hex: C397, which is two 
bytes. Since both values are in 128..255, they are UTF8 codes, here 
expressing a single Unicode character.

Given any two bytes in UTF8, it is easy to see whether they are two 
ASCII character, or one (or part of) a Unicode characters, or one ASCII 
character followed by the first byte of a UTF8 sequence, or if they are 
malformed (eg. the middle of a UTF8 sequence).

There is no confusion.

>> And what is the test for, to ensure encoding is UTF8 in this ... 
>> source file? ... compiler?
> 
> MSVC has some checks, I don't know that is the logic.
> 
> 
>> Where would the 'decoded 215' come into it?
> 
> 215 is the value after decoding utf8 and producing the unicode value.

Who or what does that, and for what purpose? From what I've seen, only 
you have introduced it.

> So my suggestion is decode first.

Why? What are you comparing? Both sides of == must use UTF8 or Unicode, 
but why introduce Unicode at all if apparently everything in source code 
and at compile time, as you yourself have stated, is UTF8?

> The bad part of my suggestion we may have two different ways of 
> producing the same value.
> 
> For instance the number generated by ab is the same of
> 
> 'ab' == '𤤰'

I don't think so. If I run this program:

  #include <stdio.h>
  #include <string.h>

  int main() {
    printf("%u\n", '×');
    printf("%04X\n", '×');
    printf("%u\n", 'ab');
    printf("%04X\n", 'ab');
    printf("%u\n", '𤤰');
    printf("%04X\n", '𤤰');
  }

I get this output (I've left out the decimal versions for clarity):

C397                ×

6162                ab

F0A4A4B0            𤤰

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
clash with some other Unicode character.

[toc] | [prev] | [next] | [standalone]

#387572

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 14:40 -0300
Message-ID	<v9iq5k$hhhs$1@dont-email.me>
In reply to	#387571

On 14/08/2024 14:07, Bart wrote:
> On 14/08/2024 17:10, Thiago Adams wrote:
>> On 14/08/2024 12:34, Bart wrote:
> 
>>> In that case I don't understand what you are testing for here. Is it 
>>> an error for '×' to be 215, or an error for it not to be?
>>
>>
>> GCC handles this as multibyte. Without decoding.
>>
>> The result of GCC is 50071
>> static_assert('×' == 50071);
>>
>> The explanation is that GCC is doing:
>>
>> 256*195 + 151 = 50071
> 
> So the 50071 is the 2-byte UTF8 sequence.

50071 is the result of multiplying the first byte 195*256 and adding the 
second byte 151. (This is NOT UTF8 related, this is the way C compilers 
generates the value)

On the other hand, DECODING, bytes 195 and 151 using UTF8 gives us the 
result of 215, that is the unicode value.


> 
> 
>> (Remember the utf8 bytes were 195 151)
>>
>> The way 'ab' is handled is the same of '×' on GCC.
> 
> I don't understand. 'a' and 'b' each occupy one byte. Together they need 
> two bytes.
> Where's the problem? Are you perhaps confused as to what UTF8 is?

I am not confused.

The problem is that the value of 'ab' is not defined in C. So I want to 
use this but it is a warning.


> 
> The 50071 above is much better expressed as hex: C397, which is two 
> bytes. Since both values are in 128..255, they are UTF8 codes, here 
> expressing a single Unicode character.


I am using '==' etc.. to represent token numbers.


> Given any two bytes in UTF8, it is easy to see whether they are two 
> ASCII character, or one (or part of) a Unicode characters, or one ASCII 
> character followed by the first byte of a UTF8 sequence, or if they are 
> malformed (eg. the middle of a UTF8 sequence).
> 
> There is no confusion.
> 
> 
> 
>>> And what is the test for, to ensure encoding is UTF8 in this ... 
>>> source file? ... compiler?
>>
>> MSVC has some checks, I don't know that is the logic.
>>
>>
>>> Where would the 'decoded 215' come into it?
>>
>> 215 is the value after decoding utf8 and producing the unicode value.
> 
> Who or what does that, and for what purpose? From what I've seen, only 
> you have introduced it.

?
Any modern language will make '×' as 215 (the unicode value). But these 
languages don't allow multi chars like 'ab'.
New languages are like U'×' in C.

>> So my suggestion is decode first.
> 
> Why? What are you comparing? Both sides of == must use UTF8 or Unicode, 
> but why introduce Unicode at all if apparently everything in source code 
> and at compile time, as you yourself have stated, is UTF8?
> 
>> The bad part of my suggestion we may have two different ways of 
>> producing the same value.
>>
>> For instance the number generated by ab is the same of
>>
>> 'ab' == '𤤰'
> 
> I don't think so. If I run this program:
> 
>   #include <stdio.h>
>   #include <string.h>
> 
>   int main() {
>     printf("%u\n", '×');
>     printf("%04X\n", '×');
>     printf("%u\n", 'ab');
>     printf("%04X\n", 'ab');
>     printf("%u\n", '𤤰');
>     printf("%04X\n", '𤤰');
>   }

This is not running the algorithm I am suggesting!This 'ab' == '𤤰' 
happens only in the say I am suggesting. No compiler is doing that today.
(I never imagined this would cause such confusion in understanding)



> 
> I get this output (I've left out the decimal versions for clarity):
> 
> C397                ×
> 
> 6162                ab
> 
> F0A4A4B0            𤤰
> 
> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
> clash with some other Unicode character.
> 
> 

My suggestion again. I am using string but imagine this working with 
bytes from file.


#include <stdio.h>
#include <assert.h>

const unsigned char* utf8_decode(const unsigned char* s, int* c)
{
     if (s[0] == '\0')
     {
         *c = 0;
         return NULL; /*end*/
     }

     const unsigned char*  next = NULL;
     if (s[0] < 0x80)
     {
         *c = s[0];
         assert(*c >= 0x0000 && *c <= 0x007F);
         next = s + 1;
     }
     else if ((s[0] & 0xe0) == 0xc0)
     {
         *c = ((int)(s[0] & 0x1f) << 6) |
             ((int)(s[1] & 0x3f) << 0);
         assert(*c >= 0x0080 && *c <= 0x07FF);
         next = s + 2;
     }
     else if ((s[0] & 0xf0) == 0xe0)
     {
         *c = ((int)(s[0] & 0x0f) << 12) |
             ((int)(s[1] & 0x3f) << 6) |
             ((int)(s[2] & 0x3f) << 0);
         assert(*c >= 0x0800 && *c <= 0xFFFF);
         next = s + 3;
     }
     else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4))
     {
         *c = ((int)(s[0] & 0x07) << 18) |
             ((int)(s[1] & 0x3f) << 12) |
             ((int)(s[2] & 0x3f) << 6) |
             ((int)(s[3] & 0x3f) << 0);
         assert(*c >= 0x10000 && *c <= 0x10FFFF);
         next = s + 4;
     }
     else
     {
         *c = -1;      // invalid
         next = s + 1; // skip this byte
     }

     if (*c >= 0xd800 && *c <= 0xdfff)
     {
         *c = -1; // surrogate half
     }

     return next;
}

int get_value(const char* s0)
{
    const char * s = s0;
    int value = 0;
    int  uc;
    s = utf8_decode(s, &uc);
    while (s)
    {
      if (uc < 0x007F)
      {
         //multichar formula
         value = value*256+uc;
      }
      else
      {
         //single char
         value = uc;
         break; //check if there is more then error..
      }
      s = utf8_decode(s, &uc);
    }
    return value;
}

int main(){
   printf("%d\n", get_value(u8"×"));
   printf("%d\n", get_value(u8"ab"));
}

[toc] | [prev] | [next] | [standalone]

#387573

From	Bart <bc@freeuk.com>
Date	2024-08-14 19:12 +0100
Message-ID	<v9is2h$i0sd$1@dont-email.me>
In reply to	#387572

On 14/08/2024 18:40, Thiago Adams wrote:
> On 14/08/2024 14:07, Bart wrote:

>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
>> clash with some other Unicode character.
>>
>>
> 
> My suggestion again. I am using string but imagine this working with 
> bytes from file.
> 
> 
> #include <stdio.h>
> #include <assert.h>

...
> int get_value(const char* s0)
> {
>     const char * s = s0;
>     int value = 0;
>     int  uc;
>     s = utf8_decode(s, &uc);
>     while (s)
>     {
>       if (uc < 0x007F)
>       {
>          //multichar formula
>          value = value*256+uc;
>       }
>       else
>       {
>          //single char
>          value = uc;
>          break; //check if there is more then error..
>       }
>       s = utf8_decode(s, &uc);
>     }
>     return value;
> }
> 
> int main(){
>    printf("%d\n", get_value(u8"×"));
>    printf("%d\n", get_value(u8"ab"));
> }

I see your problem. You're mixing things up.

gcc will combine BYTE values together (by shifting by 8 bits or 
multiplying by 256), including the individual bytes that represent UTF8.

You are combining ONLY ASCII bytes, and comparing the results with 
21-bit Unicode values.

That is meaningless. I'm not surprised you get a clash between A*256+B, 
and some arbitrary Unicode index.

[toc] | [prev] | [next] | [standalone]

#387574

From	Thiago Adams <thiago.adams@gmail.com>
Date	2024-08-14 15:28 -0300
Message-ID	<v9isvq$i0fs$1@dont-email.me>
In reply to	#387573

On 14/08/2024 15:12, Bart wrote:
> On 14/08/2024 18:40, Thiago Adams wrote:
>> On 14/08/2024 14:07, Bart wrote:
> 
>>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
>>> clash with some other Unicode character.
>>>
>>>
>>
>> My suggestion again. I am using string but imagine this working with 
>> bytes from file.
>>
>>
>> #include <stdio.h>
>> #include <assert.h>
> 
> ...
>> int get_value(const char* s0)
>> {
>>     const char * s = s0;
>>     int value = 0;
>>     int  uc;
>>     s = utf8_decode(s, &uc);
>>     while (s)
>>     {
>>       if (uc < 0x007F)
>>       {
>>          //multichar formula
>>          value = value*256+uc;
>>       }
>>       else
>>       {
>>          //single char
>>          value = uc;
>>          break; //check if there is more then error..
>>       }
>>       s = utf8_decode(s, &uc);
>>     }
>>     return value;
>> }
>>
>> int main(){
>>    printf("%d\n", get_value(u8"×"));
>>    printf("%d\n", get_value(u8"ab"));
>> }
> 
> I see your problem. You're mixing things up.


The objective is :
  - make single characters have the Unicode value without  having to use U''
  - allow more than one chars like 'ab' in some cases where each 
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)

> gcc will combine BYTE values together (by shifting by 8 bits or 
> multiplying by 256), including the individual bytes that represent UTF8.
> 
> You are combining ONLY ASCII bytes, and comparing the results with 
> 21-bit Unicode values.
> 
> That is meaningless. I'm not surprised you get a clash between A*256+B, 
> and some arbitrary Unicode index.
> 

In any case..my suggestion looks dangerous. But meanwhile this is not 
well specified in the standard.

[toc] | [prev] | [next] | [standalone]

#387575

From	Bart <bc@freeuk.com>
Date	2024-08-14 20:32 +0100
Message-ID	<v9j0oe$in82$1@dont-email.me>
In reply to	#387574

On 14/08/2024 19:28, Thiago Adams wrote:
> On 14/08/2024 15:12, Bart wrote:
>> On 14/08/2024 18:40, Thiago Adams wrote:
>>> On 14/08/2024 14:07, Bart wrote:
>>
>>>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
>>>> clash with some other Unicode character.
>>>>
>>>>
>>>
>>> My suggestion again. I am using string but imagine this working with 
>>> bytes from file.
>>>
>>>
>>> #include <stdio.h>
>>> #include <assert.h>
>>
>> ...
>>> int get_value(const char* s0)
>>> {
>>>     const char * s = s0;
>>>     int value = 0;
>>>     int  uc;
>>>     s = utf8_decode(s, &uc);
>>>     while (s)
>>>     {
>>>       if (uc < 0x007F)
>>>       {
>>>          //multichar formula
>>>          value = value*256+uc;
>>>       }
>>>       else
>>>       {
>>>          //single char
>>>          value = uc;
>>>          break; //check if there is more then error..
>>>       }
>>>       s = utf8_decode(s, &uc);
>>>     }
>>>     return value;
>>> }
>>>
>>> int main(){
>>>    printf("%d\n", get_value(u8"×"));
>>>    printf("%d\n", get_value(u8"ab"));
>>> }
>>
>> I see your problem. You're mixing things up.
> 
> 
> The objective is :
>   - make single characters have the Unicode value without  having to use 
> U''
>   - allow more than one chars like 'ab' in some cases where each 
> character is less than 0x007F. This can break code for instance '¼¼'.
> but I am suspecting people are not using in this way (I hope)

Obviously that can't work, for example because two printable ASCII 
characters with codes 32 to 96, will have values from 1024 to 9216 when 
combined in a character literal. Those are going to clash with Unicode 
characters with those values.

It won't work either at compile-time or runtime.

You need to choose between Unicode representation and UTF8. Either that 
or use some prefix to disambiguate in source code, but you still need 
decide whether '€' in source code is represented as the Unicode bytes 20 
AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide 
which end of those sequences will be the least signfificant byte.

> In any case..my suggestion looks dangerous. But meanwhile this is not 
> well specified in the standard.

It wasn't well-specified even when dealing with 100% ASCII. For example, 
'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another, 
maybe just 0x41 or 0x42 on a third, or even 0x41410000.

[toc] | [prev] | [next] | [standalone]

#387578

From	Lawrence D'Oliveiro <ldo@nz.invalid>
Date	2024-08-15 02:43 +0000
Message-ID	<v9jpvn$q8q3$3@dont-email.me>
In reply to	#387569

On Wed, 14 Aug 2024 13:10:01 -0300, Thiago Adams wrote:

> The result of GCC is 50071 static_assert('×' == 50071);
> 
> The explanation is that GCC is doing:
> 
> 256*195 + 151 = 50071
> 
> (Remember the utf8 bytes were 195 151)

That would be an endian-dependent interpretation.

[toc] | [prev] | [next] | [standalone]

#387577

From	Lawrence D'Oliveiro <ldo@nz.invalid>
Date	2024-08-15 02:41 +0000
Message-ID	<v9jptc$q8q3$2@dont-email.me>
In reply to	#387566

On Wed, 14 Aug 2024 10:31:59 -0300, Thiago Adams wrote:

> 215 is the unicode number of the character '×'.

Be careful about the use of the term “character” in Unicode.

Unicode defines “code points”. A “grapheme” (which I think is their term 
for “character”) can be made up of one or more “code points”, with no 
upper limit on their number.

[toc] | [prev] | [next] | [standalone]

#387576

From	Lawrence D'Oliveiro <ldo@nz.invalid>
Date	2024-08-15 01:39 +0000
Message-ID	<v9jm8f$m2ot$1@dont-email.me>
In reply to	#387565

On Wed, 14 Aug 2024 14:05:22 +0100, Bart wrote:

> I get the impression that C's wide characters are intended for those
> Unicode indices, but that's not going to work well on Windows with its
> 16-bit wide character type.

Unfortunately, Windows (like Java) is shackled to the UTF-16 Albatross, 
owing to embracing Unicode at exactly the wrong time.

[toc] | [prev] | [next] | [standalone]

#387552

From	Ben Bacarisse <ben@bsb.me.uk>
Date	2024-08-14 01:32 +0100
Message-ID	<874j7ot04x.fsf@bsb.me.uk>
In reply to	#387544

Thiago Adams <thiago.adams@gmail.com> writes:

> static_assert('×' == 50071);

static_assert(U'×' == 215);

works, but then I don't know what you were trying to do.

> GCC -  warning multi byte
> CLANG - error character too large
>
> I think instead of "multi bytes" we need "multi characters" - not
> bytes.
>
> We decode utf8 then we have the character to decide if it is multi char or
> not.

These terms can be confusing and I don't know exactly how you are using
them.  Basically I simply don't know what that second sentence is
saying.

> decoding '×' would consume bytes 195 and 151 the result is the decoded
> Unicode value of 215.

Yes, Unicode 215 is UTF-8 encoded as two bytes with values 195 and 151.

> It is not multi byte : 256*195 + 151 = 50071

If that × is UTF-8 encoded then it might look, to the compiler, just
like an old-fashioned multi-character character constant just like 'ab'
does.  Then again, it might not.  gcc and clan take different views on
the matter.

You can get clang to that the same view a gcc by writing

  static_assert('\xC3\x97' == 50071);

instead.  Now both gcc and clang see it for what it is: an old-fashioned
multi-character character constant.

> O the other hand 'ab' is "multi character" resulting

The term for these things used to be "multi-byte character constant" and
they were highly non-portable.  The trouble is that the term "multi-byte
character" now refers to highly portable encodings like UTF-8.  Maybe
that's why gcc seems to have changed it's warning from what you gave to:

  warning: multi-character character constant [-Wmultichar]

> 256 * 'a' + 'b' = 256*97+98= 24930
>
> One consequence is that
>
> 'ab' == '𤤰'
>
> But I don't think this is a problem. At least everything is defined.
>

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#387561

From	Richard Damon <richard@damon-family.org>
Date	2024-08-13 23:44 -0400
Message-ID	<1ffb2244967a28423c968f4b4a9fec5a2553f356@i2pn2.org>
In reply to	#387544

On 8/13/24 10:45 AM, Thiago Adams wrote:
> static_assert('×' == 50071);
> 
> GCC -  warning multi byte
> CLANG - error character too large
> 
> I think instead of "multi bytes" we need "multi characters" - not bytes.
> 
> We decode utf8 then we have the character to decide if it is multi char 
> or not.
> 
> decoding '×' would consume bytes 195 and 151 the result is the decoded 
> Unicode value of 215.
> 
> It is not multi byte : 256*195 + 151 = 50071
> 
> O the other hand 'ab' is "multi character" resulting
> 
> 256 * 'a' + 'b' = 256*97+98= 24930
> 
> One consequence is that
> 
> 'ab' == '𤤰'
> 
> But I don't think this is a problem. At least everything is defined.

When you use the single quotes by themselves ('), you are specifying 
characters in the narrow character set, typically ASCII, but might be 
some other 8-bit character encoding. It can not specify extended 
character beyond those.

You can (if the implementation allows it) place multiple characters in 
the constant to get an integer value with those characters packed.

When you use the double quotes by themselves ("), you are specifying a 
string of these narrow characters, although this form might allow for 
multi-byte encodings of some characters, like is done with UTF-8.

You can specifiy wide character constants by the syntax of L'x', u'x', 
or U'x'.

L'x' will give you what ever the inplementation calls its "wide 
character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but 
doesn't need to be.

The u'x' form will always be USC-2/UTF-16, and U'x' will always be 
UCS-4/UTF-32

Like the plain 'x' form, the results from a single character, can not be 
a multi-unit value, so u'x' can't generate a two surrogate pairs for a 
single source character.

Change the ' to a " and you get wide strings, just like the characters, 
but now u"xx" and L"xx" can generate charaters that use surrogate pairs 
(or other multi-part encodings for L"xxx")

[toc] | [prev] | [standalone]

csiph-web

multi bytes character - how to make it defined behavior?

Contents

#387544 — multi bytes character - how to make it defined behavior?

#387550

#387553

#387564

#387565

#387566

#387568

#387569

#387570

#387571

#387572

#387573

#387574

#387575

#387578

#387577

#387576

#387552

#387561