Path: csiph.com!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: Programming exercise/challenge
Date: Thu, 24 Dec 2020 11:04:51 -0800
Organization: A noiseless patient Spider
Lines: 196
Message-ID: <86eejfyqi4.fsf@linuxsc.com>
References: <86wnxwkyol.fsf@linuxsc.com> <86r1o1hxtn.fsf@linuxsc.com> <3aBzH.1938$8f2.1340@fx16.iad> <865z5b5jk2.fsf@linuxsc.com> <zV3AH.35492$7K1.14383@fx46.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="420230f19e4989c44eb0b3a54f412058"; logging-data="14501"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18HypRMMgKFKbun6FREriwBLqDw82byn2c="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:7NwNkJbmhV2ClSVvh0owWyBTAy0= sha1:oucBsEs5iVckWDLE6T7ALLpDxY0=
Xref: csiph.com comp.lang.c:157699

Richard Damon <Richard@Damon-Family.org> writes:

> On 12/9/20 6:02 AM, Tim Rentsch wrote:
>
>> Richard Damon <Richard@Damon-Family.org> writes:
>>
>>> On 12/7/20 8:48 PM, Tim Rentsch wrote:
>>>
>>>> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>>>>
>>>>> Prompted by some recent discussion regarding 'goto' statements
>>>>> and state machines, I would like to propose a programming
>>>>> exercise.  (It is perhaps a bit too large to be called an
>>>>> exercise, but not so difficult that it deserves the label of
>>>>> challenge.  On the other hand there are some constraints so
>>>>> maybe challenge is apropos.  In any case somewhere in between
>>>>> those two bounds.)
>>>>>
>>>>> Short problem statement:  a C program to remove comments from C
>>>>> source input.
>>>>>
>>>>> Specifics:  Remove both /*...*/ and //... style comments.  Don't
>>>>> worry about trigraphs.  Read from stdin, write to stdout, and
>>>>> diagnostics (if any) go to stderr.  If EOF is seen inside a
>>>>> comment, do something sensible but it doesn't matter what as
>>>>> long as it's sensible.  Use no 'goto' statements.  Limit
>>>>> function bodies to no more than 25 lines.
>>>>>
>>>>> Other:  feel free to handle corner cases as you see fit, as long
>>>>> as there is some description of what choice was made.
>>>>>
>>>>> Hopefully it will be a fun exercise.  It isn't trivial but it
>>>>> shouldn't take too long either.
>>>>
>>>> I see there has been a fair amount of activity.  Also some
>>>> questions and some confusions, so I am prompted to give some
>>>> clarifications.
>>>>
>>>> The program is to remove (see below) C comments from a C source
>>>> file input, and nothing else.  An input with no C comments in it
>>>> should be transmitted unchanged (provided its compile-time
>>>> behavior is defined, see below).  To give an obvious example, a
>>>> multi-line macro definition that uses \ at the end of lines to
>>>> continue the definition (but has no comments) should appear in
>>>> the output exactly as in the input.
>>>>
>>>> The remark about "something sensible" for EOF might be phrased as
>>>> anything that doesn't violate The Law of Least Astonishment.
>>>> Simply stopping output is okay, either with or without an error
>>>> return or diagnostic.
>>>>
>>>> The statement about corner cases is meant to apply to compile
>>>> time undefined behavior (meaning, in the input), and nothing
>>>> else.  Any input whose compile-time specification has defined
>>>> behavior, including implementation-defined behavior, should be
>>>> processed correctly.  If an input has compile-time undefined
>>>> behavior, do something reasonable (like in the previous
>>>> paragraph), but it doesn't matter so much what exactly as long as
>>>> the user is given some idea of what that will be.
>>>>
>>>> An important property is that the program should transform
>>>> working programs into the same working programs.  In other word
>>>> compiling an input before it is transformed should give the same
>>>> semantics as compiling it after it is transformed.  There is an
>>>> exception to this rule in cases where the C pre-processor
>>>> stringize operator is used.  In some cases removing a comment and
>>>> doing nothing else cannot be done because doing so will change
>>>> the meaning of the program.  To deal with that problem, it is
>>>> allowed to put in a space for a removed comment.  Note, putting
>>>> in a space is allowed but not required, as long as the input
>>>> program semantics (not counting stringize) is unchanged.  Hence
>>>> it is permissible for programs that use the stringize operator to
>>>> exhibit different behavior before and after being transformed.
>>>> Except for that though the output semantics should match the
>>>> input semantics.
>>>>
>>>> Re: removing comments.  This might be done by removing comments
>>>> entirely (and putting in a single space in some cases), or it
>>>> might be done by replacing comments with some "filler" white
>>>> space to, for example, preserve line numbers.  Undoubtedly it is
>>>> easier to simply take the comments out and put in a single space
>>>> in their place;  more elaborate replacements are allowed as long
>>>> as the output has no comments and preserves the program semantics
>>>> of the original input.
>>>>
>>>> Part of my motivation in offering the exercise is to see how
>>>> people handle a non-trivial "state machiney" kind of program
>>>> without using goto statements and without using "big" functions.
>>>> The choice of 25 lines is meant to codify that, but it isn't
>>>> meant as an exact hard limit - maybe five lines over (to
>>>> accommodate style differences) is okay, ten lines over is pushing
>>>> it.
>>>>
>>>> Incidentally, the problem statement isn't something I just made
>>>> up for the newsgroup, but is a simplified version of a utility
>>>> program that is used as part of a larger toolkit.  Of course I
>>>> knew this when I first posted the problem, so I had a more fixed
>>>> idea than the original problem statement presented clearly,
>>>> giving rise to the resulting confusions etc.  I'm sorry the
>>>> original problem wasn't stated more clearly, and I hope the
>>>> statements here clear up any remaining uncertainties.  Please,
>>>> please, ask about something if there is any doubt about what is
>>>> meant.
>>>
>>> One piece of implemention defined behavior that you need to allow
>>> not handling is the case of \<sp><nl> (where <sp> is a space
>>> character and <nl> is a new line character)
>>>
>>> The issue is that whether the following line is a continuation of
>>> this line replacing the \) or not is implementaiton defined, but
>>> the program is going to need to make a firm decision on this.
>>>
>>> This case is a special case in the standard because of the
>>> flexibilty of handling speces at the end of lines in text files in
>>> C.  It is allowed for implementations to add or remove trailing
>>> spaces at the end of lines, due to how some file systems deal with
>>> lines.
>>>
>>> Basically the Standard allows the character sequence /\<sp><nl>*
>>> to either be the beginning of a block comment or not based on the
>>> definition of the implementation.
>>
>> It's true that an implementation has some freedom with regard to
>> allowing, adding, or removing spaces at the end of a line.
>> However that possibility is irrelevant as far as the comment
>> removing program is concerned.  Whatever input it gets is the
>> input it gets, and that input is what determines what output gets
>> produced.  The earlier remark about implementation-defined
>> behavior is limited to /what would be/ implementation-defined
>> behavior /if the input read were viewed as a C program/.  What
>> happens with the implementation on which the comment removing
>> program is running falls in a different category.
>>
>> To say this another way, as far as the problem statement is
>> concerned, comment removing programs are free to assume that
>> spaces are faithfully preserved even at the ends of lines,
>> both when the source input is read and when any program
>> output might be read later.
>
> But, my understanding of the standard is that the complier is free
> to interprete the file:  (note using @ at the end of lines as a
> visible space character)
>
>
>   a /\@
> * foo
> ;
> // */
>
> as either a line continuation code or not, and this behavior is
> NOT tied to how it treats blanks at the end of lines.  While it
> allows the implemetation to ignore the blanks if the definition of
> text files might add them, it also allows the implementation to
> ignore them even if it doesn't, or more perversely, it consider
> them if the defintion might insert them.  Leaving the solution of
> the perverse case just up to 'quality of implementation'.

Let me see if I can help untangle the verbal spider web here.

The C standard discusses Input/Output in section 7.21, with
section 7.21.2 being about streams, and paragraph 2 of 7.21.2
being specifically about text streams.  That paragraph says in
part:

    A text stream is an ordered sequence of characters composed
    into lines, each line consisting of zero or more characters
    plus a terminating new-line character.  [...]  Whether space
    characters that are written out immediately before a new-line
    character appear when read in is implementation-defined.

A compiler doesn't have to read its input using a text stream,
although certainly it could.  However, to do that, the compiler
must have been built using a previously existing implementation.
It is the previously existing implementation that decides how
text stream I/O will be handled for the compiler, not the
compiler itself.  Even if the PEI was built using the very same
source files, it is a distinct implementation, and whatever
decision was made is a done deal.  A compiler (and associated
libraries, etc) does get to decide what happens with programs
that it compiles, but /not/ what happens for its own input:  that
decision was made already by the implementation used to build the
compiler.  Whatever input it gets is what it gets, and that must
be treated as what the file contains.

A de-commenting program is in a sense a compiler-like program, in
that it reads input meant to be C source code.  Such programs
should therefore be built using the PEI that was used to build
the compiler that will later compile that C source code.  If they
are not, then all bets are off:  we might see 16-bit characters
rather than 8-bit characters, or have program source be read as
EBCDIC rather than ASCII.  Of course the de-commenting program
and the intended compiler should be made to match, as otherwise
nonsensical results may ensue.

A compiler that gets '\\', ' ', '\n', in its input stream, and
treats that input as a line continuation, is not conforming.