Path: csiph.com!news.mixmin.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch Newsgroups: comp.lang.c Subject: Re: signed/unsigned - what will fail Date: Wed, 16 Aug 2023 20:12:19 -0700 Organization: A noiseless patient Spider Lines: 45 Message-ID: <86bkf65pm4.fsf@linuxsc.com> References: <21d1ef97-8620-4115-b412-7279e0ef4d6bn@googlegroups.com> <7ffec8c7-1b4c-4c3c-9342-daed7af19dabn@googlegroups.com> <89f530cb-dd82-46f9-9567-a1f81e55d239n@googlegroups.com> <6fcfcd10-82e7-4ecb-8bec-e6292ff73322n@googlegroups.com> <87zg2qfyy8.fsf@bsb.me.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: dont-email.me; posting-host="a8782d2d7d1c356e90db8dd7e2df2f84"; logging-data="3831301"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/iBEEH1wWTlNAYFrPN25MzjtAOoioCflY=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:3bryD5Hw9ci+yE1M2+lUd9NGB6E= sha1:3nrrtqoBZ6ANpTTjTe/uXMC4yjs= Xref: csiph.com comp.lang.c:172433 Ben Bacarisse writes: > David Brown writes: > >> On 16/08/2023 09:02, Malcolm McLean wrote: >> >>> On Wednesday, 16 August 2023 at 07:53:32 UTC+1, David Brown wrote: >>> >>>> It is, of course, a terrible idea to do any kind of arithmetic or >>>> hold numbers in plain char - use them for 7-bit ASCII characters >>>> only. >>> >>> I pass about UTF-8 as char *s in Baby X. But of course it is >>> converted to unsigned char for the actual UTF-8 manipulations. > > I think too much is often made of this. What is it that you are > worried about? A lot of UTF-8 fiddling is masking values that will > have been promoted to int. The masking can be more portable than > the conversion. I'm not sure the question is so clear cut. First the pointer conversion (from char * to unsigned char *) is absolutely guaranteed to work, and accesses through the unsigned char * are guaranteed to work. The only question is what bits do you get. Of course if the implementation is two's complement then it doesn't matter. But if it isn't, where did the bits come from? That matters because values that go through functions may have been converted to -- or re-interpreted as, it isn't clear which -- unsigned char. If a file is being read that was produced under a different implementation, reading the bytes as char's rather than unsigned char's could result in incorrect values. Or, unfortunately, vice versa. Speaking for myself normally I would prefer to do UTF8-style processing through unsigned char pointers. My reasoning is it's easier to think about and (probably) more likely to work in the unusual cases. Also, now that I think of it, safer, because unsigned char cannot have trap representations. Also if there is some sort of encoding problem I have more confidence in solving the problem working directly on unsigned char values than in reasoning through what will happen when working with the signed values. Conversely if I were reading code that was doing UTF8 processing and using plain char, I think I would need to work harder to understand how it works. So FWIW there is a personal view.