Path: csiph.com!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch Newsgroups: comp.lang.c Subject: Re: Programming exercise/challenge Date: Sun, 27 Dec 2020 06:29:24 -0800 Organization: A noiseless patient Spider Lines: 151 Message-ID: <86h7o7wce3.fsf@linuxsc.com> References: <86wnxwkyol.fsf@linuxsc.com> <86sg7uygsg.fsf@linuxsc.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: reader02.eternal-september.org; posting-host="7428659b5bf23ec7739a8c9d5f9c9404"; logging-data="15983"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XCk8eVeq/UTq22l03vJQticxhG5ai1Dc=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:EUGpparEBqMQEuAs9CPznYdGtXE= sha1:XeaU20wBHJRbezoa8rXVdNbsHNc= Xref: csiph.com comp.lang.c:157818 kegs@provalid.com (Kent Dickey) writes: > In article <86sg7uygsg.fsf@linuxsc.com>, > Tim Rentsch wrote: > >> kegs@provalid.com (Kent Dickey) writes: >> >>> Here's my solution. It works on my simple test cases. It should be >>> tested more. [...] >> >> that is, the output should be identical to the input. The source >> has no comments, so nothing should change. >> >> If you have trouble finding my posting clarifying the problem >> statement, let me know and I can repost it. >> >> I should also say your program is an interesting take on how to >> address the problem, which is nice to see. After seeing your >> code here, I'm curious to see how you might solve the problem as >> first intended. > > First, the description gave no limit on input size. In order > to allow a 32-bit system to handle files over 4GB requires > character-at-a-time-handling. Yes, AFAICS, it does. > The statemachine required is tricky, It is. > and there are still cases where it's unclear what the right > output should be. Here is I think an unambiguous specification: Any characters outside of comments should be kept exactly as they appear in the source. Any characters part of a // comment should be removed, remembering that the final newline is not part of the comment and so should be left in the output[*]. The characters removed should include any "\\\n" pairs that occur after the first / and before the final newline. Any characters part of a /*.*/ comment should be removed and replaced by a single space. The characters removed should include any "\\\n" pairs that occur between the initial / and the final /. Things like unterminated strings or character literals can be treated as ending at the newline or EOF. [*] Under the suggested rule, a line like this #define BACK_SLANT \// yes, a backslant! would result as a line with a . Unfortunately this might subsequently be treated as a line continuation, which obviously is not what is intended. Even so, to keep things simple, simply take out // comments and replace them with nothing. > Keeping escaped newlines around is a pain. It looks like you are trying to keep line counts the same, which is a nice property. Unfortunately it complicates the problem, so you should feel free to follow the simpler rules given above. > The case I found hardest to handle is: > > /\ > \ > [Repeat escaped newlines for as long as you want]... > \ > / This is a comment > Last line > > And then replacing the final line with something NOT starting > with a '/'. It's also unclear what removing the comment should > be in this case. THe way I wrote it, the above would become as > output: > \ > \ > [ repeat....[ > \ > > Last line > > Basically, I treat the escaped newlines between '/' and '/' as being > BEFORE the comment, and so they make the output. Removing them > takes one line of code to reset tokptr->escaped_newlines=0 when a > comment starts. Okay, I may try making that change in your new source and see what results. > I don't know what the right answer here should be. What you're doing is fine. What I consider my current reference implementation simply removes // comment altogether, including any embedded line continuations. Either one is okay. > To parse this a character-at-a-time seems to require counting the > escaped newlines, and then when we determine what the next valid > character is, outputting that count. So that's what I did. This > is now a "tricky" algorithm. Certainly the problem is not trivial. > I also added simple error checking for: quotes lines that seem to > be open when a newline is seen; comments still open when the file > ends. Some test cases you might want to try (remember to take out leading indentation): "\\ t" '\\ t' Both of these are valid input to the C pre-processor. The first is a string with one character (plus a final null character), and the second is a one-character character literal. The character is \t, or TAB, in each of the two cases. > As for handling strings, I tried to do what "cc -E" does: quoted > strings can START with an escaped quote character, but can only > end with a non- escpaed quote character. So: > > \' /* This is not a comment */ ' > > Is a valid line where there is no comment to be removed since it's > inside a string. In the C pre-processor, a \ by itself is a valid pre-processor token. So what I think is happening here is that the pre-processor sees a one-character token (\), followed by an ordinary character literal (with too many characters, but still a valid token in the C pre-processor). > Handling escaped quotes is handled by adding 0x100 to the > character value, so "\\'" is not "'" and won't end a quoted > string. Did you mean "\"" or '\''? If not then I am confused. > It's gotten quite long. [...] I am adding it to the set of submissions.