Path: csiph.com!news.mixmin.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch
Newsgroups: comp.lang.c
Subject: Re: signed/unsigned - what will fail
Date: Wed, 16 Aug 2023 20:12:19 -0700
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <86bkf65pm4.fsf@linuxsc.com>
References: <21d1ef97-8620-4115-b412-7279e0ef4d6bn@googlegroups.com> <7ffec8c7-1b4c-4c3c-9342-daed7af19dabn@googlegroups.com> <89f530cb-dd82-46f9-9567-a1f81e55d239n@googlegroups.com> <6fcfcd10-82e7-4ecb-8bec-e6292ff73322n@googlegroups.com> <87zg2qfyy8.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="a8782d2d7d1c356e90db8dd7e2df2f84"; logging-data="3831301"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/iBEEH1wWTlNAYFrPN25MzjtAOoioCflY="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:3bryD5Hw9ci+yE1M2+lUd9NGB6E= sha1:3nrrtqoBZ6ANpTTjTe/uXMC4yjs=
Xref: csiph.com comp.lang.c:172433
Ben Bacarisse writes:
> David Brown writes:
>
>> On 16/08/2023 09:02, Malcolm McLean wrote:
>>
>>> On Wednesday, 16 August 2023 at 07:53:32 UTC+1, David Brown wrote:
>>>
>>>> It is, of course, a terrible idea to do any kind of arithmetic or
>>>> hold numbers in plain char - use them for 7-bit ASCII characters
>>>> only.
>>>
>>> I pass about UTF-8 as char *s in Baby X. But of course it is
>>> converted to unsigned char for the actual UTF-8 manipulations.
>
> I think too much is often made of this. What is it that you are
> worried about? A lot of UTF-8 fiddling is masking values that will
> have been promoted to int. The masking can be more portable than
> the conversion.
I'm not sure the question is so clear cut. First the pointer
conversion (from char * to unsigned char *) is absolutely
guaranteed to work, and accesses through the unsigned char * are
guaranteed to work. The only question is what bits do you get.
Of course if the implementation is two's complement then it
doesn't matter. But if it isn't, where did the bits come from?
That matters because values that go through functions
may have been converted to -- or re-interpreted as, it isn't
clear which -- unsigned char. If a file is being read that was
produced under a different implementation, reading the bytes as
char's rather than unsigned char's could result in incorrect
values. Or, unfortunately, vice versa.
Speaking for myself normally I would prefer to do UTF8-style
processing through unsigned char pointers. My reasoning is it's
easier to think about and (probably) more likely to work in the
unusual cases. Also, now that I think of it, safer, because
unsigned char cannot have trap representations. Also if there is
some sort of encoding problem I have more confidence in solving
the problem working directly on unsigned char values than in
reasoning through what will happen when working with the signed
values. Conversely if I were reading code that was doing UTF8
processing and using plain char, I think I would need to work
harder to understand how it works. So FWIW there is a personal
view.