Path: csiph.com!news.mixmin.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: Programming exercise/challenge
Date: Wed, 27 Jan 2021 13:43:08 -0800
Organization: A noiseless patient Spider
Lines: 357
Message-ID: <86tur2jbtf.fsf@linuxsc.com>
References: <86wnxwkyol.fsf@linuxsc.com> <e138c342-f48a-4f08-97e9-7f74e185dd12n@googlegroups.com> <rt2qes$gpf$1@dont-email.me> <78fdc5df-a4e2-40f0-ae4e-48741d331faen@googlegroups.com> <2DaJH.72833$kZ1.12267@fx37.iad> <86d52273-f9eb-45a3-8a8d-5a106679d1c8n@googlegroups.com> <zeiJH.13295$5s2.12914@fx18.iad> <97ebd421-8f0f-4621-bdb9-96d655ba9d9an@googlegroups.com> <IknJH.13301$5s2.3361@fx18.iad> <de2dba77-e273-484f-aeef-e01ffd374b88n@googlegroups.com> <f74bd34d-0c8c-4d94-ba83-1cddf210084en@googlegroups.com> <f57a787a-4790-4e00-9662-bcf35ef69e05n@googlegroups.com> <b71fe02c-3211-433a-a9f7-c6a67686ff1dn@googlegroups.com> <86czybpljj.fsf@linuxsc.com> <1ebc19c3-093d-4bc0-b680-aedbb9576eb5n@googlegroups.com> <868s8uozxo.fsf@linuxsc.com> <6a2e3784-1cb3-4d6a-99c2-6ac650dd6fben@googlegroups.com> <864kjbo2ql.fsf@linuxsc.com> <44c73f50-33af-43bb-afa2-a3b0244c7160n@googlegroups.com> <865z3nm8f9.fsf@linuxsc.com> <5049cc01-a549-4022-8f4c-deda87ab20c9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader02.eternal-september.org; posting-host="a3a7a3546220eb2a4a7cfe4d4683198c"; logging-data="21198"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX182L8m26lYYwZVcXbPl04P4hBNTBpb7Q5k="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:8v/xLEoTnmL7KGp154xhzU8f4CA= sha1:ldvwuClNfX/1uOORufVuvLD2tD4=
Xref: csiph.com comp.lang.c:158668

Dave Dunfield <dave.dunfield@gmail.com> writes:

[..quoted text rearranged for convenience of responding..]

> Please do let me know of any other/continued problems you find.
> I'm kindof interested in making this work in the best possible
> way.

I have some comments below.  Probably some of them will be about
layout or other style issues, but most should be about program
structure.  As a warning, my comments might be rather harshly
worded in some cases, but please don't take that personally.
The comments are meant to be about the code, not about you.

> If you want, I can probably easily modify it to output spaces
> instead of characters in comment (and preserve all [\<newline>]
> thereby making the output exactly the same size/shape as the
> original.  I know this was mentioned in recent messages, please let
> me know if you would like me to look into this.

For now I think just stick to simply removing // comments (but not
removing the ending newlines for those), and replacing /*...*/
comments by a single space.

> /*
>  * whole line comments could easily be absent and should not
>  * count against to 25 lime function body limit.
>  *

I suggest putting any whole line comments either just before or
just after the function they relate to.  A line of code can be
marked with a //[n] indicator if needed to identify where in
the function body a particular whole line comments is meant to
apply.


> unsigned short      // at least 16 bits to insure '& 0xFF00' works
>     Cp,             // Pending input character
>     C0,             // Last character read
>     Enl;            // EscapeNewLine seen
> unsigned char
>     C1,             // Previous character read
>     Imode;          // Input mode

You have the germ of a good idea here (and carried forward in
rdchar1(), discussed below), but the names chosen and the types
used get in the way.  There is no reason to use either unsigned
short or unsigned char:  characters should be of type 'int', the
counter should be just 'unsigned', and Imode shouldn't be a
character at all, but some enum type.  The names (ignoring the
style of initial capital letter) used for the characters go
pretty strongly against the mental grain.  I suggest this (I am
using all lower case but please don't focus on that):

    int c0, c1, c2;    // { last, current, next } character
    int s_type;        // to remember '\'' or '"' for strings
    unsigned n_lcs;    // number of line continuations
    // state variable to be declared later


> // Line continuation sequences occurring within comments are
> // removed. But what about comments starting with [/\<newline>*]
> // [/<newline>/] ?

Presumably you meant [/\<newline>/] at the end there.

> // I decided to remove them,

That is the right thing to do

> // but if you prefer those [\<newline>]s to
> // remain, change all [/*1] in this file to [//1], [...]

It would be nicer if this sort of conditional behavior were done
using regular code and not using either comments or the C
pre-processor.


> // Read a single character, ignoring C sequence: \<nl>
> // Added to simulates line splicing performed by "official" CPP.
> unsigned rdchar1(void)
> {
>     C1 = C0;
>     for(;;) {
>         if(C0 = Cp)
>             Cp = 0;
>         else
>             C0 = getc(stdin);
>         if(C0 != '\\')          // Not an escape
>             return C0;
>         if((C0 = getc(stdin)) != '\n') {    // Not Escaped Newline
>             Cp = C0;
>             return C0 = '\\'; }
> /*1*/   ++Enl;              // So we can re-insert it in output
> //2*/   if(!Imode)
> //2*/       fputs("\\\n", stdout);
>         }
> }

First the code has a hidden brace (just before the /*1*/ line).
Please don't do that.

Looking at the semantics, I can't help the feeling that the code
is both working too hard and is too hard to understand.  Using
the declarations I wrote above, it can be written more concisely
thus (also I changed the name - 'rdchar1' is horrible):

    int
    read_next(){
        c0 = c1,   n_lcs = 0,   c1 = c2,   c2 = getchar();

        while(  c1 == '\\'  &&  c2 == '\n'  ){
            n_lcs++,  c1 = getchar(),  c2 = getchar();
        }

        return  c1;
    }

Notice the code is simpler because the invariant is easier to
write.  The values in c0, c1, and c2 are consistent at all times:
last, current, and next.  The number of LCs (in n_lcs) goes with
the character in c1 so it is initialized to zero at the start of
the routine.  Do you see how this writing corresponds (even if
not exactly) to your rdchar1() function?


> // Read one character, excluding content of comments
> unsigned char rdchar2(void)
> {
>     for(;;) {
> // Assumes 8-bit char set, not including 0 and EOF is >255
>         if(rdchar1() & 0xFF00)
>             return 0;
> // changing this to multiple 'if's could eliminate the 'switch' line
>         switch(Imode) {
>         case '*' :      // Inside '/*' comment
>             if((C1 == '*') && (C0 == '/')) {    // Ending '/*'
>                 Imode = Enl = 0;
>                 C0=' '; }   // Replace /*...*/ with single space
>             continue;
>         case '/' :      // Inside '//' comment
>             if(C1 == '\n') {
>                 Imode = Enl = 0;
>                 break; }
>             continue;
>         case 0  :       // No special mode
> // could be considered as two lines in one, although it's a construct I
> // sometimes use if a single 'switch' is the only thing within an 'if'
>             if(C1 == '/') switch(C0) {
>                 case '*':       // Starting '/*' comment
>                 case '/':       // Starting '//' comment
>                     Imode = C0;
>                     continue; } }
>         return C1; }
> }
>

Style comments:  the layout is horrible.  The name rdchr2() doesn't
tell me anything.  More hidden braces.  Combinations of break's and
continue's (break's go with the switch(), continue's go with the for()
and are there to pass over the return statement at the end), along
with the if()/switch() -- all of these make the code very hard to
decipher.

Re-writing to clean up the layout and better show the control flow:

    // Read one character, excluding content of comments
    unsigned char
    read_skipping_comment_characters( void )
    {
        for(;;) {
            if(rdchar1() & 0xFF00)
                return 0;
    
            switch(Imode) {
              case '*' :      // Inside '/*' comment
                if((C1 == '*') && (C0 == '/')) {    // Ending '/*'
                    Imode = Enl = 0;
                    C0=' ';    // Replace /*...*/ with single space
                }
                break;

              case '/' :      // Inside '//' comment
                if(C1 == '\n') {
                    Imode = Enl = 0;
                }
                break;

              case 0  :       // No special mode
                if(C1 == '/'  &&  (C0 == '*' || C0 == '/')  ){
                    Imode = C0;
                    break;
                 }
                 /* fallthrough */

               default:
                 return C1;
            }
        }
    }

Now at least we can see what's going on.  The code is kind of
strange though.  Broadly speaking there are two ways to approach
this problem.  One, with a state machine.  Two, using a lexing or
token scanning type algorithm.  The function here shows that both
are going on - the function as a whole is kind of lexing, but
internally it uses a state machine.  That's more complicated than it
needs to be.

> // Read characters, and hande '\\' escape sequences
> // Also enter quote input mode if quote character is seen
> void handle_esc(void)
> {
>     switch(rdchar2()) {
>     case '\\':  // Escape
>         putc(C1, stdout);
>         rdchar2();
>     case 0 :    // EOF
>         return;
>     case '"':
>     case'\'':
>         if(Imode == C1)
>             Imode = 0;
>         else if(!Imode) {
>             Imode = C1; } }
> }
>
> int main(void)
> {
>     rdchar2();  // First call to rdchar2 will return 0;
>     do {
>         handle_esc();
>         [... output <bs><nl> per Enl ...]
>         putc(C1, stdout);
>     } while(!(C0 & 0xFF00));    // Till EOF has been read
> }

At this point I am lost.  The code seems very confused.  Normally I
would go through and try to fix things up to the point where it will
work, but I kind of gave up here;  I thought would be easier to
write a whole new program from scratch than try to fix this one.

Note by the way that the LCs are output only here, but there is
more than one place that rdchar2() is called, which means some
LCs are potentially lost (ie, not transmitted to the output, or
not transmitted in the right places).

Having shown stuff done for the bottom of the newly written
program (with the character reading routine read_next()), let's
look at the top of the program, that is, mainly, main():

    #include <stdio.h>

    typedef enum {
	W_zero,     // ground state
	W_slant,    // a '/' was seen
	W_c90,      // in a C90-style comment
	W_c99,      // in a C99-style comment
	W_string,   // in a string or character constant
	W_done,     // premature ending - error detected
    } What;

    static  What    (*go[ W_done ])( int );
    static  int     read_next( void );
    static  void    out_c1( void );
    static  void    out_lcs( unsigned );

    static int      c0, c1, c2, s_type;
    static unsigned n_lcs;

    int
    main(){
	What w = W_zero;
	c2 = getchar();

	while(  w != W_done  &&  read_next() != EOF  ){
            w = go[w]( c1 );
        }

	return  0;
    }

The variable 'w' has the state of the FSM.  The loop calls one of
the "go" routines, depending on what state the top-level loop is
in.  Note that 'w' is local and set only here, and read_next() is
called in only one place, in the test of the while() loop.  The
state of the program is thus quite simple.  Because there is only
one state for both string literals and character constants, there
needs to be some global state for that, and that is the variable
's_type' (and some of the "go" routines needs to set or test
that).


Here are skeletons for the "go" routines:


    What
    go_zero( int c ){
	return  W_done;  // dummy to be replaced
    }

    What
    go_slant( int c ){
	return  W_done;  // dummy to be replaced
    }

    What
    go_c90( int c ){
	return  W_done;  // dummy to be replaced
    }

    What
    go_c99( int c ){
	return  W_done;  // dummy to be replaced
    }

    What
    go_string( int c ){
	return  W_done;  // dummy to be replaced
    }

    static  What (*go[ W_done ])( int ) = {
	go_zero,
	go_slant,
	go_c90,
	go_c99,
	go_string
    };

The "go" routines can be written in a small number of lines each
(no more than 40 or 50 lines altogether).


It's convenient for there to be two output routines, one for the
character in c1 together with any preceeding line continuation
sequence, and for for just an LC sequence without any character
attached.  

    void
    out_c1(){
        out_lcs( n_lcs ),  putchar( c1 );
    }

    void
    out_lcs( unsigned n ){
	while(  n-- > 0  )  fputs( "\\\n", stdout );
    }

I don't know how much sense all this makes to you.  I'm trying to
provide what constructive comments I can, but unfortunately am
hampered by finding the submitted program code confusing.  Maybe
this will give you a new idea for how to approach the problem, or
an idea for how to simplify your existing code.  If you're
feeling adventurous you could try filling in the "go" routines
and see how that goes (no pun intended).

Do you have questions, or any other reactions?