Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.programming > #16319

Re: Scanning

From Richard Heathfield <rjh@cpax.org.uk>
Newsgroups comp.programming
Subject Re: Scanning
Date 2023-01-19 12:43 +0000
Organization Fix this later
Message-ID <tqbdu1$1hm7a$1@dont-email.me> (permalink)
References <Scanning-20230119123241@ram.dialup.fu-berlin.de>

Show all headers | View raw


On 19/01/2023 12:10 pm, Stefan Ram wrote:
>    Some idle thoughts about scanning (lexical analysis, or
>    rather what comes before it) ...
> 
>    Let's take a very simple task: This scanner for text files
>    has nothing more to do than to return every character,
>    except to strip the spaces at the end of a line.
> 
>    It is a function "get_next_token" that on each call will
>    return the next character from a file to its client (caller),
>    except that spaces at the end of a line will skipped.
> 
>    So we read the line and strip the spaces. (One line in
>    Python.)
> 
>    But how do I know in advance if the line will fit into
>    memory?
> 
>    Perhaps because of such fears, traditional scanners¹ do not
>    read lines or, Heaven forbid, files, but only characters!
> 
>    They do not use random access with respect to the text to be
>    scanned, but sequential access, although things would be
>    easier with random access.
> 
>    So how would you do it with this style of programming (never
>    reading the whole line into memory)?
> 
>    "I read a character. If it's a space, I peek at the next
>    character, if that's a space, I start adding spaces to my
>    look-ahead buffer. If an EOL is encountered, the look-ahead
>    buffer is discarded. Otherwise, I have to start feeding my
>    client from the lookahead buffer until the lookahead buffer
>    is empty."
> 
>    If I am concerned that a line will not fit in memory, how do
>    I know that the sequence of spaces at the end of a line will
>    fit in memory (the look-ahead buffer)? The look-ahead buffer
>    could be replaced by a counter. If you are paranoid, you
>    would use a 64-bit counter and check it for overflow!
> 
>    Is it worth the effort with a look-ahead buffer and
>    sequential access? Should you just read a line, assuming
>    that a line will always fit into memory, and strip the
>    blanks the easy way, i.e., using random access? TIA for any
>    comments!
> 
>    1
> 
>    an example of a traditional scanner:
> 
>    It only ever calls "GetCh", never "GetLine". The code could
>    be easier to write by reading a whole line and then just
>    using functions that can look at that line using random
>    access to get the next symbol (maybe using regular
>    expressions). But a traditional scanner carefully only ever
>    reads a single character and manages a state.
> 
> PROCEDURE GetSym;
> 
> VAR     i          : CARDINAL;
> 
> BEGIN
>    WHILE  ch <= ' '  DO  GetCh  END;
>    IF  ch = '/'  THEN
>      SkipLine;
>      WHILE  ch <= ' '  DO  GetCh  END
>    END;
>    IF  (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A')  THEN
>      i := 0;
>      sym := literal;
>      REPEAT
>        IF  i < IdLength  THEN
>          id [i] := ch;
>          INC (i)
>        END;
>        IF  ch > 'Z' THEN  sym := ident  END;
>        GetCh
>        ...

man 3 realloc

This was a perennial comp.lang.c topic back in the day.

My interface looked (and still looks) like this:

#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE  1

int fgetline(char **line, size_t *size, size_t maxrecsize, FILE 
*fp, unsigned int flags, size_t *plen);

It's easier to use than it might look:

   char *data = NULL; /* where will the data go? NULL is fine */
   size_t size = 0;   /* how much space do we have right now? */
   size_t len = 0;    /* after call, holds line length */

   while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
   {
     if(len > 0)

If you want fgetline.c and don't have 20 years of clc archives, 
just yell.

-- 
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Back to comp.programming | Previous | NextNext in thread | Find similar


Thread

Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 12:43 +0000
  Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 13:56 +0000
  Re: Scanning V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee> - 2023-01-27 01:46 -0800

csiph-web