Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.programming > #16352

Re: Scanning

Newsgroups comp.programming
Date 2023-01-27 01:46 -0800
References <Scanning-20230119123241@ram.dialup.fu-berlin.de> <tqbdu1$1hm7a$1@dont-email.me>
Message-ID <5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com> (permalink)
Subject Re: Scanning
From V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee>

Show all headers | View raw


You are a devil !




On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote:
> On 19/01/2023 12:10 pm, Stefan Ram wrote: 
> > Some idle thoughts about scanning (lexical analysis, or 
> > rather what comes before it) ... 
> > 
> > Let's take a very simple task: This scanner for text files 
> > has nothing more to do than to return every character, 
> > except to strip the spaces at the end of a line. 
> > 
> > It is a function "get_next_token" that on each call will 
> > return the next character from a file to its client (caller), 
> > except that spaces at the end of a line will skipped. 
> > 
> > So we read the line and strip the spaces. (One line in 
> > Python.) 
> > 
> > But how do I know in advance if the line will fit into 
> > memory? 
> > 
> > Perhaps because of such fears, traditional scanners¹ do not 
> > read lines or, Heaven forbid, files, but only characters! 
> > 
> > They do not use random access with respect to the text to be 
> > scanned, but sequential access, although things would be 
> > easier with random access. 
> > 
> > So how would you do it with this style of programming (never 
> > reading the whole line into memory)? 
> > 
> > "I read a character. If it's a space, I peek at the next 
> > character, if that's a space, I start adding spaces to my 
> > look-ahead buffer. If an EOL is encountered, the look-ahead 
> > buffer is discarded. Otherwise, I have to start feeding my 
> > client from the lookahead buffer until the lookahead buffer 
> > is empty." 
> > 
> > If I am concerned that a line will not fit in memory, how do 
> > I know that the sequence of spaces at the end of a line will 
> > fit in memory (the look-ahead buffer)? The look-ahead buffer 
> > could be replaced by a counter. If you are paranoid, you 
> > would use a 64-bit counter and check it for overflow! 
> > 
> > Is it worth the effort with a look-ahead buffer and 
> > sequential access? Should you just read a line, assuming 
> > that a line will always fit into memory, and strip the 
> > blanks the easy way, i.e., using random access? TIA for any 
> > comments! 
> > 
> > 1 
> > 
> > an example of a traditional scanner: 
> > 
> > It only ever calls "GetCh", never "GetLine". The code could 
> > be easier to write by reading a whole line and then just 
> > using functions that can look at that line using random 
> > access to get the next symbol (maybe using regular 
> > expressions). But a traditional scanner carefully only ever 
> > reads a single character and manages a state. 
> > 
> > PROCEDURE GetSym; 
> > 
> > VAR i : CARDINAL; 
> > 
> > BEGIN 
> > WHILE ch <= ' ' DO GetCh END; 
> > IF ch = '/' THEN 
> > SkipLine; 
> > WHILE ch <= ' ' DO GetCh END 
> > END; 
> > IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN 
> > i := 0; 
> > sym := literal; 
> > REPEAT 
> > IF i < IdLength THEN 
> > id [i] := ch; 
> > INC (i) 
> > END; 
> > IF ch > 'Z' THEN sym := ident END; 
> > GetCh 
> > ... 
> 
> man 3 realloc 
> 
> This was a perennial comp.lang.c topic back in the day. 
> 
> My interface looked (and still looks) like this: 
> 
> #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */ 
> #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification") 
> #define FGDATA_REDUCE 1 
> 
> int fgetline(char **line, size_t *size, size_t maxrecsize, FILE 
> *fp, unsigned int flags, size_t *plen); 
> 
> It's easier to use than it might look: 
> 
> char *data = NULL; /* where will the data go? NULL is fine */ 
> size_t size = 0; /* how much space do we have right now? */ 
> size_t len = 0; /* after call, holds line length */ 
> 
> while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0) 
> { 
> if(len > 0) 
> 
> If you want fgetline.c and don't have 20 years of clc archives, 
> just yell. 
> 
> -- 
> Richard Heathfield 
> Email: rjh at cpax dot org dot uk 
> "Usenet is a strange place" - dmr 29 July 1999 
> Sig line 4 vacant - apply within

Back to comp.programming | Previous | NextPrevious in thread | Find similar


Thread

Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 12:43 +0000
  Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 13:56 +0000
  Re: Scanning V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee> - 2023-01-27 01:46 -0800

csiph-web