Path: csiph.com!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch
Newsgroups: comp.lang.c
Subject: Re: Programming exercise/challenge
Date: Tue, 29 Dec 2020 19:18:44 -0800
Organization: A noiseless patient Spider
Lines: 133
Message-ID: <86v9ckugkr.fsf@linuxsc.com>
References: <86wnxwkyol.fsf@linuxsc.com> <871rg2rffu.fsf@bsb.me.uk> <86v9dehts2.fsf@linuxsc.com> <87360hq0si.fsf@bsb.me.uk> <1bpzH.151400$zz79.48736@fx17.ams4> <86lfe779k6.fsf@linuxsc.com> <865z4ryphr.fsf@linuxsc.com> <877dp79cl9.fsf@nosuchdomain.example.com> <8635zrxx30.fsf@linuxsc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="bad9f3398650fdf0e991b5310f5ac4d7"; logging-data="15117"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18VG66XPSCuI5At3zpd+f38eYvkxwFPnEc="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:OFCCHmBbjOzAr27ktdYhU4CvD/M= sha1:YYtPI8kksS7kWOubXAOJdx1UsWs=
Xref: csiph.com comp.lang.c:157877
Richard Damon writes:
> On 12/27/20 7:17 AM, Tim Rentsch wrote:
>
>> Keith Thompson writes:
>>
>>> Tim Rentsch writes:
>>>
>>>> Richard Damon writes:
>>>>
>>>>> On 12/9/20 1:55 AM, Tim Rentsch wrote:
>>>>>
>>>>>> Bart writes:
>>>>>> [...]
>>>>>>
>>>>>>> The spec did say to make your own decisions on corner cases.
>>>>>>
>>>>>> Corner cases are meant to be only for input that the C
>>>>>> standard specifies as undefined behavior.
>>>>>
>>>>> or implementation defined or unspecified behavior, like the
>>>>> \ case.
>>>>
>>>> No, I meant what I said. Furthermore any compiler that
>>>> accepts \ as a line continuation is not
>>>> conforming, as I have explained else-thread.
>>>
>>> I thought I had seen (and perhaps even made) an argument that phase 1:
>>>
>>> Physical source file multibyte characters are mapped, in an
>>> implementation-defined manner, to the source character set
>>> (introducing new-line characters for end-of-line indicators) if
>>> necessary.
>>>
>>> could include removing trailing spaces. I admit it's a bit of a
>>> stretch of the meaning of "mapped".
>>
>> There is no way to make that work. Let me call the two kinds
>> of spaces [PSF] and [SCS]. If we have a physical
>> source file with a line
>>
>> int[PSF]x;[PSF][PSF][PSF]
>>
>> presumably you would want that to map to
>>
>> int[SCS]x;[SCS]
>>
>> which means [PSF] would be a single-byte character and
>> also part of a non-single-byte multi-byte character. It can't
>> be both.
>
> Who says that is prohibited at a system level? Note that this sort of
> stuff DOES happen in some character sets, that you need to use a bit of
> look ahead to decode what a character means.
Yes, it wouldn't be surprising to see such a thing in cases where
for example the unadorned character is an 'e' and the adorned
character is an 'e' with an accent.
On the other hand, it's a safe bet that no existing character set
has a multi-byte character that is simply a redundant representation
of a single-byte character, or has unbounded lookahead. The point
of mapping physical source multi-byte characters is to conform to
an externally chosen representation, not to let a C implementation
transmogrify the input according to its whims.
Short summary: probably technically within the letter of the C
standard, but surely not the intended meaning.
> Note also, that the issue does NOT exist at the level of the Source
> Character Set, as by the point we get to that, the spaces before the
> newline have been removed, so the source character set does not have
> that issue.
The source /character set/ certainly does have the problem. A
bizarro mapping can eliminate the possibility of a particular input
during stage 1, but the source character set still has the ability
to represent the undesired inputs.
>>> Aside from what the standard actually says, I certainly think it's
>>> more convenient to be able to ignore spaces at the end of a line
>>> following a \ character. Treating backslash+space at the end of a
>>> line differently than backslash at the end of a line is awkward,
>>> even if it's conforming. I'd like to see a modification to phase 1
>>> saying that white space between \ and the end of a line is
>>> discarded.
>>
>> I oppose such a change. It's a needless complication and an
>> unnecessary incompatibility. I have compiled probably tens of
>> millions of lines, if not hundreds of millions of lines, of C
>> code, and I don't remember ever seeing this problem except in
>> cases constructed specifically to test this rule. Moreover if
>> someone wants to guard against end-of-line spaces it's almost
>> trivial to do that with a single grep or sed command. Or just
>> compile with gcc, which will give a diagnostic in the particular
>> case of spaces/tabs between \ and newline. Any change to the C
>> standard, and especially any change that causes incompatibility
>> between different versions, should yield substantial benefits to
>> justify its introduction. The case we're talking about here
>> occurs so rarely that it is nowhere close to reaching the bar.
>
> Maybe I am just 'luckier' than you, as I HAVE seen cases in the wild
> where it made a difference. Code was copy and pasted from an article and
> cleaned up. Results was much of the code had trailing spaces, and the
> lines that had multi-line macros just didn't compile right.
>
> The person who did this was TOTALLY puzzled by the error messages,
> because in the editor it was clear that this was a continuation line,
> but only with careful operation of the editor could the space after the
> \ be detected.
Even so, the ROI is very close to zero. It isn't like this happens
every day, or even every year; you get puzzled once, and after it
happens the makefiles are fixed so it doesn't happen again. Done.
> As far as I know, the only way valid code would be broken with this
> would be a line oriented comment that ends with a \ followed by spaces
> then the next line is not a comment. (the most likely cause would be a
> comment ending with a Linux path name to a directory, especially to root
#define BACKSLANT \ (with spaces after the \)
#define SOMETHING ...
> I am not sure that such a line would be considered good practice anyway,
> as if you ever DID one of the cleanups you propose, it would break that
> line.
Oh nonsense. Using grep would flag the line but not change
anything. An intentional white-space-after-backslant could
be done using a TAB character rather than a space, assuming
one wanted to do that. Or a sed command could change spaces
after a backslant into '\/*!*/' and then a subsequent grep
could look for that pattern. etc...