Path: csiph.com!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch
Newsgroups: comp.lang.c
Subject: Re: Programming exercise/challenge
Date: Sun, 27 Dec 2020 06:29:24 -0800
Organization: A noiseless patient Spider
Lines: 151
Message-ID: <86h7o7wce3.fsf@linuxsc.com>
References: <86wnxwkyol.fsf@linuxsc.com> <86sg7uygsg.fsf@linuxsc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="7428659b5bf23ec7739a8c9d5f9c9404"; logging-data="15983"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XCk8eVeq/UTq22l03vJQticxhG5ai1Dc="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:EUGpparEBqMQEuAs9CPznYdGtXE= sha1:XeaU20wBHJRbezoa8rXVdNbsHNc=
Xref: csiph.com comp.lang.c:157818
kegs@provalid.com (Kent Dickey) writes:
> In article <86sg7uygsg.fsf@linuxsc.com>,
> Tim Rentsch wrote:
>
>> kegs@provalid.com (Kent Dickey) writes:
>>
>>> Here's my solution. It works on my simple test cases. It should be
>>> tested more. [...]
>>
>> that is, the output should be identical to the input. The source
>> has no comments, so nothing should change.
>>
>> If you have trouble finding my posting clarifying the problem
>> statement, let me know and I can repost it.
>>
>> I should also say your program is an interesting take on how to
>> address the problem, which is nice to see. After seeing your
>> code here, I'm curious to see how you might solve the problem as
>> first intended.
>
> First, the description gave no limit on input size. In order
> to allow a 32-bit system to handle files over 4GB requires
> character-at-a-time-handling.
Yes, AFAICS, it does.
> The statemachine required is tricky,
It is.
> and there are still cases where it's unclear what the right
> output should be.
Here is I think an unambiguous specification:
Any characters outside of comments should be kept exactly
as they appear in the source.
Any characters part of a // comment should be removed,
remembering that the final newline is not part of the
comment and so should be left in the output[*]. The
characters removed should include any "\\\n" pairs that
occur after the first / and before the final newline.
Any characters part of a /*.*/ comment should be removed
and replaced by a single space. The characters removed
should include any "\\\n" pairs that occur between the
initial / and the final /.
Things like unterminated strings or character literals
can be treated as ending at the newline or EOF.
[*] Under the suggested rule, a line like this
#define BACK_SLANT \// yes, a backslant!
would result as a line with a . Unfortunately
this might subsequently be treated as a line continuation, which
obviously is not what is intended. Even so, to keep things simple,
simply take out // comments and replace them with nothing.
> Keeping escaped newlines around is a pain.
It looks like you are trying to keep line counts the same, which
is a nice property. Unfortunately it complicates the problem,
so you should feel free to follow the simpler rules given above.
> The case I found hardest to handle is:
>
> /\
> \
> [Repeat escaped newlines for as long as you want]...
> \
> / This is a comment
> Last line
>
> And then replacing the final line with something NOT starting
> with a '/'. It's also unclear what removing the comment should
> be in this case. THe way I wrote it, the above would become as
> output:
> \
> \
> [ repeat....[
> \
>
> Last line
>
> Basically, I treat the escaped newlines between '/' and '/' as being
> BEFORE the comment, and so they make the output. Removing them
> takes one line of code to reset tokptr->escaped_newlines=0 when a
> comment starts.
Okay, I may try making that change in your new source and see
what results.
> I don't know what the right answer here should be.
What you're doing is fine. What I consider my current reference
implementation simply removes // comment altogether, including
any embedded line continuations. Either one is okay.
> To parse this a character-at-a-time seems to require counting the
> escaped newlines, and then when we determine what the next valid
> character is, outputting that count. So that's what I did. This
> is now a "tricky" algorithm.
Certainly the problem is not trivial.
> I also added simple error checking for: quotes lines that seem to
> be open when a newline is seen; comments still open when the file
> ends.
Some test cases you might want to try (remember to take out
leading indentation):
"\\
t"
'\\
t'
Both of these are valid input to the C pre-processor. The first
is a string with one character (plus a final null character), and
the second is a one-character character literal. The character
is \t, or TAB, in each of the two cases.
> As for handling strings, I tried to do what "cc -E" does: quoted
> strings can START with an escaped quote character, but can only
> end with a non- escpaed quote character. So:
>
> \' /* This is not a comment */ '
>
> Is a valid line where there is no comment to be removed since it's
> inside a string.
In the C pre-processor, a \ by itself is a valid pre-processor
token. So what I think is happening here is that the pre-processor
sees a one-character token (\), followed by an ordinary character
literal (with too many characters, but still a valid token in the
C pre-processor).
> Handling escaped quotes is handled by adding 0x100 to the
> character value, so "\\'" is not "'" and won't end a quoted
> string.
Did you mean "\"" or '\''? If not then I am confused.
> It's gotten quite long. [...]
I am adding it to the set of submissions.