Path: csiph.com!news.mixmin.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch Newsgroups: comp.lang.c Subject: Re: Programming exercise/challenge Date: Wed, 27 Jan 2021 13:43:08 -0800 Organization: A noiseless patient Spider Lines: 357 Message-ID: <86tur2jbtf.fsf@linuxsc.com> References: <86wnxwkyol.fsf@linuxsc.com> <78fdc5df-a4e2-40f0-ae4e-48741d331faen@googlegroups.com> <2DaJH.72833$kZ1.12267@fx37.iad> <86d52273-f9eb-45a3-8a8d-5a106679d1c8n@googlegroups.com> <97ebd421-8f0f-4621-bdb9-96d655ba9d9an@googlegroups.com> <86czybpljj.fsf@linuxsc.com> <1ebc19c3-093d-4bc0-b680-aedbb9576eb5n@googlegroups.com> <868s8uozxo.fsf@linuxsc.com> <6a2e3784-1cb3-4d6a-99c2-6ac650dd6fben@googlegroups.com> <864kjbo2ql.fsf@linuxsc.com> <44c73f50-33af-43bb-afa2-a3b0244c7160n@googlegroups.com> <865z3nm8f9.fsf@linuxsc.com> <5049cc01-a549-4022-8f4c-deda87ab20c9n@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: reader02.eternal-september.org; posting-host="a3a7a3546220eb2a4a7cfe4d4683198c"; logging-data="21198"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX182L8m26lYYwZVcXbPl04P4hBNTBpb7Q5k=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:8v/xLEoTnmL7KGp154xhzU8f4CA= sha1:ldvwuClNfX/1uOORufVuvLD2tD4= Xref: csiph.com comp.lang.c:158668 Dave Dunfield writes: [..quoted text rearranged for convenience of responding..] > Please do let me know of any other/continued problems you find. > I'm kindof interested in making this work in the best possible > way. I have some comments below. Probably some of them will be about layout or other style issues, but most should be about program structure. As a warning, my comments might be rather harshly worded in some cases, but please don't take that personally. The comments are meant to be about the code, not about you. > If you want, I can probably easily modify it to output spaces > instead of characters in comment (and preserve all [\] > thereby making the output exactly the same size/shape as the > original. I know this was mentioned in recent messages, please let > me know if you would like me to look into this. For now I think just stick to simply removing // comments (but not removing the ending newlines for those), and replacing /*...*/ comments by a single space. > /* > * whole line comments could easily be absent and should not > * count against to 25 lime function body limit. > * I suggest putting any whole line comments either just before or just after the function they relate to. A line of code can be marked with a //[n] indicator if needed to identify where in the function body a particular whole line comments is meant to apply. > unsigned short // at least 16 bits to insure '& 0xFF00' works > Cp, // Pending input character > C0, // Last character read > Enl; // EscapeNewLine seen > unsigned char > C1, // Previous character read > Imode; // Input mode You have the germ of a good idea here (and carried forward in rdchar1(), discussed below), but the names chosen and the types used get in the way. There is no reason to use either unsigned short or unsigned char: characters should be of type 'int', the counter should be just 'unsigned', and Imode shouldn't be a character at all, but some enum type. The names (ignoring the style of initial capital letter) used for the characters go pretty strongly against the mental grain. I suggest this (I am using all lower case but please don't focus on that): int c0, c1, c2; // { last, current, next } character int s_type; // to remember '\'' or '"' for strings unsigned n_lcs; // number of line continuations // state variable to be declared later > // Line continuation sequences occurring within comments are > // removed. But what about comments starting with [/\*] > // [//] ? Presumably you meant [/\/] at the end there. > // I decided to remove them, That is the right thing to do > // but if you prefer those [\]s to > // remain, change all [/*1] in this file to [//1], [...] It would be nicer if this sort of conditional behavior were done using regular code and not using either comments or the C pre-processor. > // Read a single character, ignoring C sequence: \ > // Added to simulates line splicing performed by "official" CPP. > unsigned rdchar1(void) > { > C1 = C0; > for(;;) { > if(C0 = Cp) > Cp = 0; > else > C0 = getc(stdin); > if(C0 != '\\') // Not an escape > return C0; > if((C0 = getc(stdin)) != '\n') { // Not Escaped Newline > Cp = C0; > return C0 = '\\'; } > /*1*/ ++Enl; // So we can re-insert it in output > //2*/ if(!Imode) > //2*/ fputs("\\\n", stdout); > } > } First the code has a hidden brace (just before the /*1*/ line). Please don't do that. Looking at the semantics, I can't help the feeling that the code is both working too hard and is too hard to understand. Using the declarations I wrote above, it can be written more concisely thus (also I changed the name - 'rdchar1' is horrible): int read_next(){ c0 = c1, n_lcs = 0, c1 = c2, c2 = getchar(); while( c1 == '\\' && c2 == '\n' ){ n_lcs++, c1 = getchar(), c2 = getchar(); } return c1; } Notice the code is simpler because the invariant is easier to write. The values in c0, c1, and c2 are consistent at all times: last, current, and next. The number of LCs (in n_lcs) goes with the character in c1 so it is initialized to zero at the start of the routine. Do you see how this writing corresponds (even if not exactly) to your rdchar1() function? > // Read one character, excluding content of comments > unsigned char rdchar2(void) > { > for(;;) { > // Assumes 8-bit char set, not including 0 and EOF is >255 > if(rdchar1() & 0xFF00) > return 0; > // changing this to multiple 'if's could eliminate the 'switch' line > switch(Imode) { > case '*' : // Inside '/*' comment > if((C1 == '*') && (C0 == '/')) { // Ending '/*' > Imode = Enl = 0; > C0=' '; } // Replace /*...*/ with single space > continue; > case '/' : // Inside '//' comment > if(C1 == '\n') { > Imode = Enl = 0; > break; } > continue; > case 0 : // No special mode > // could be considered as two lines in one, although it's a construct I > // sometimes use if a single 'switch' is the only thing within an 'if' > if(C1 == '/') switch(C0) { > case '*': // Starting '/*' comment > case '/': // Starting '//' comment > Imode = C0; > continue; } } > return C1; } > } > Style comments: the layout is horrible. The name rdchr2() doesn't tell me anything. More hidden braces. Combinations of break's and continue's (break's go with the switch(), continue's go with the for() and are there to pass over the return statement at the end), along with the if()/switch() -- all of these make the code very hard to decipher. Re-writing to clean up the layout and better show the control flow: // Read one character, excluding content of comments unsigned char read_skipping_comment_characters( void ) { for(;;) { if(rdchar1() & 0xFF00) return 0; switch(Imode) { case '*' : // Inside '/*' comment if((C1 == '*') && (C0 == '/')) { // Ending '/*' Imode = Enl = 0; C0=' '; // Replace /*...*/ with single space } break; case '/' : // Inside '//' comment if(C1 == '\n') { Imode = Enl = 0; } break; case 0 : // No special mode if(C1 == '/' && (C0 == '*' || C0 == '/') ){ Imode = C0; break; } /* fallthrough */ default: return C1; } } } Now at least we can see what's going on. The code is kind of strange though. Broadly speaking there are two ways to approach this problem. One, with a state machine. Two, using a lexing or token scanning type algorithm. The function here shows that both are going on - the function as a whole is kind of lexing, but internally it uses a state machine. That's more complicated than it needs to be. > // Read characters, and hande '\\' escape sequences > // Also enter quote input mode if quote character is seen > void handle_esc(void) > { > switch(rdchar2()) { > case '\\': // Escape > putc(C1, stdout); > rdchar2(); > case 0 : // EOF > return; > case '"': > case'\'': > if(Imode == C1) > Imode = 0; > else if(!Imode) { > Imode = C1; } } > } > > int main(void) > { > rdchar2(); // First call to rdchar2 will return 0; > do { > handle_esc(); > [... output per Enl ...] > putc(C1, stdout); > } while(!(C0 & 0xFF00)); // Till EOF has been read > } At this point I am lost. The code seems very confused. Normally I would go through and try to fix things up to the point where it will work, but I kind of gave up here; I thought would be easier to write a whole new program from scratch than try to fix this one. Note by the way that the LCs are output only here, but there is more than one place that rdchar2() is called, which means some LCs are potentially lost (ie, not transmitted to the output, or not transmitted in the right places). Having shown stuff done for the bottom of the newly written program (with the character reading routine read_next()), let's look at the top of the program, that is, mainly, main(): #include typedef enum { W_zero, // ground state W_slant, // a '/' was seen W_c90, // in a C90-style comment W_c99, // in a C99-style comment W_string, // in a string or character constant W_done, // premature ending - error detected } What; static What (*go[ W_done ])( int ); static int read_next( void ); static void out_c1( void ); static void out_lcs( unsigned ); static int c0, c1, c2, s_type; static unsigned n_lcs; int main(){ What w = W_zero; c2 = getchar(); while( w != W_done && read_next() != EOF ){ w = go[w]( c1 ); } return 0; } The variable 'w' has the state of the FSM. The loop calls one of the "go" routines, depending on what state the top-level loop is in. Note that 'w' is local and set only here, and read_next() is called in only one place, in the test of the while() loop. The state of the program is thus quite simple. Because there is only one state for both string literals and character constants, there needs to be some global state for that, and that is the variable 's_type' (and some of the "go" routines needs to set or test that). Here are skeletons for the "go" routines: What go_zero( int c ){ return W_done; // dummy to be replaced } What go_slant( int c ){ return W_done; // dummy to be replaced } What go_c90( int c ){ return W_done; // dummy to be replaced } What go_c99( int c ){ return W_done; // dummy to be replaced } What go_string( int c ){ return W_done; // dummy to be replaced } static What (*go[ W_done ])( int ) = { go_zero, go_slant, go_c90, go_c99, go_string }; The "go" routines can be written in a small number of lines each (no more than 40 or 50 lines altogether). It's convenient for there to be two output routines, one for the character in c1 together with any preceeding line continuation sequence, and for for just an LC sequence without any character attached. void out_c1(){ out_lcs( n_lcs ), putchar( c1 ); } void out_lcs( unsigned n ){ while( n-- > 0 ) fputs( "\\\n", stdout ); } I don't know how much sense all this makes to you. I'm trying to provide what constructive comments I can, but unfortunately am hampered by finding the submitted program code confusing. Maybe this will give you a new idea for how to approach the problem, or an idea for how to simplify your existing code. If you're feeling adventurous you could try filling in the "go" routines and see how that goes (no pun intended). Do you have questions, or any other reactions?