Path: csiph.com!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: Programming exercise/challenge
Date: Sun, 27 Dec 2020 06:29:24 -0800
Organization: A noiseless patient Spider
Lines: 151
Message-ID: <86h7o7wce3.fsf@linuxsc.com>
References: <86wnxwkyol.fsf@linuxsc.com> <C7mdnQmy4fi1dn_CnZ2dnUU7-cPNnZ2d@giganews.com> <86sg7uygsg.fsf@linuxsc.com> <lIadneM_lbYiiHXCnZ2dnUU7-YnNnZ2d@giganews.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="7428659b5bf23ec7739a8c9d5f9c9404"; logging-data="15983"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+XCk8eVeq/UTq22l03vJQticxhG5ai1Dc="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:EUGpparEBqMQEuAs9CPznYdGtXE= sha1:XeaU20wBHJRbezoa8rXVdNbsHNc=
Xref: csiph.com comp.lang.c:157818

kegs@provalid.com (Kent Dickey) writes:

> In article <86sg7uygsg.fsf@linuxsc.com>,
> Tim Rentsch  <tr.17687@z991.linuxsc.com> wrote:
>
>> kegs@provalid.com (Kent Dickey) writes:
>>
>>> Here's my solution.  It works on my simple test cases.  It should be
>>> tested more.  [...]
>>
>> that is, the output should be identical to the input.  The source
>> has no comments, so nothing should change.
>>
>> If you have trouble finding my posting clarifying the problem
>> statement, let me know and I can repost it.
>>
>> I should also say your program is an interesting take on how to
>> address the problem, which is nice to see.  After seeing your
>> code here, I'm curious to see how you might solve the problem as
>> first intended.
>
> First, the description gave no limit on input size.  In order
> to allow a 32-bit system to handle files over 4GB requires
> character-at-a-time-handling.

Yes, AFAICS, it does.

> The statemachine required is tricky,

It is.

> and there are still cases where it's unclear what the right
> output should be.

Here is I think an unambiguous specification:

    Any characters outside of comments should be kept exactly
    as they appear in the source.

    Any characters part of a // comment should be removed,
    remembering that the final newline is not part of the
    comment and so should be left in the output[*].  The
    characters removed should include any "\\\n" pairs that
    occur after the first / and before the final newline.

    Any characters part of a /*.*/ comment should be removed
    and replaced by a single space.  The characters removed
    should include any "\\\n" pairs that occur between the
    initial / and the final /.

    Things like unterminated strings or character literals
    can be treated as ending at the newline or EOF.

[*] Under the suggested rule, a line like this

    #define BACK_SLANT \// yes, a backslant!

would result as a line with a <backslant><newline>.  Unfortunately
this might subsequently be treated as a line continuation, which
obviously is not what is intended.  Even so, to keep things simple,
simply take out // comments and replace them with nothing.

> Keeping escaped newlines around is a pain.

It looks like you are trying to keep line counts the same, which
is a nice property.  Unfortunately it complicates the problem,
so you should feel free to follow the simpler rules given above.

> The case I found hardest to handle is:
>
> /\
> \
> [Repeat escaped newlines for as long as you want]...
> \
> / This is a comment
> Last line
>
> And then replacing the final line with something NOT starting
> with a '/'.  It's also unclear what removing the comment should
> be in this case.  THe way I wrote it, the above would become as
> output:
>  \
> \
> [ repeat....[
> \
>
> Last line
>
> Basically, I treat the escaped newlines between '/' and '/' as being
> BEFORE the comment, and so they make the output.  Removing them
> takes one line of code to reset tokptr->escaped_newlines=0 when a
> comment starts.

Okay, I may try making that change in your new source and see
what results.

> I don't know what the right answer here should be.

What you're doing is fine.  What I consider my current reference
implementation simply removes // comment altogether, including
any embedded line continuations.  Either one is okay.

> To parse this a character-at-a-time seems to require counting the
> escaped newlines, and then when we determine what the next valid
> character is, outputting that count.  So that's what I did.  This
> is now a "tricky" algorithm.

Certainly the problem is not trivial.

> I also added simple error checking for:  quotes lines that seem to
> be open when a newline is seen;  comments still open when the file
> ends.

Some test cases you might want to try (remember to take out
leading indentation):

    "\\
    t"

    '\\
    t'

Both of these are valid input to the C pre-processor.  The first
is a string with one character (plus a final null character), and
the second is a one-character character literal.  The character
is \t, or TAB, in each of the two cases.

> As for handling strings, I tried to do what "cc -E" does:  quoted
> strings can START with an escaped quote character, but can only
> end with a non- escpaed quote character.  So:
>
> \' /* This is not a comment */ '
>
> Is a valid line where there is no comment to be removed since it's
> inside a string.

In the C pre-processor, a \ by itself is a valid pre-processor
token.  So what I think is happening here is that the pre-processor
sees a one-character token (\), followed by an ordinary character
literal (with too many characters, but still a valid token in the
C pre-processor).

> Handling escaped quotes is handled by adding 0x100 to the
> character value, so "\\'" is not "'" and won't end a quoted
> string.

Did you mean "\"" or '\''?  If not then I am confused.

> It's gotten quite long.  [...]

I am adding it to the set of submissions.