Re: Scanning

Subject	Re: Scanning
Newsgroups	comp.programming
References	<Scanning-20230119123241@ram.dialup.fu-berlin.de>
From	Richard Damon <Richard@Damon-Family.org>
Message-ID	<e_ayL.85067$SdR7.59556@fx04.iad> (permalink)
Organization	Forte - www.forteinc.com
Date	2023-01-19 07:48 -0500

Show all headers | View raw

On 1/19/23 7:10 AM, Stefan Ram wrote:
>    Some idle thoughts about scanning (lexical analysis, or
>    rather what comes before it) ...
> 
>    Let's take a very simple task: This scanner for text files
>    has nothing more to do than to return every character,
>    except to strip the spaces at the end of a line.
> 
>    It is a function "get_next_token" that on each call will
>    return the next character from a file to its client (caller),
>    except that spaces at the end of a line will skipped.
> 
>    So we read the line and strip the spaces. (One line in
>    Python.)
> 
>    But how do I know in advance if the line will fit into
>    memory?
> 
>    Perhaps because of such fears, traditional scanners¹ do not
>    read lines or, Heaven forbid, files, but only characters!
> 
>    They do not use random access with respect to the text to be
>    scanned, but sequential access, although things would be
>    easier with random access.
> 
>    So how would you do it with this style of programming (never
>    reading the whole line into memory)?
> 
>    "I read a character. If it's a space, I peek at the next
>    character, if that's a space, I start adding spaces to my
>    look-ahead buffer. If an EOL is encountered, the look-ahead
>    buffer is discarded. Otherwise, I have to start feeding my
>    client from the lookahead buffer until the lookahead buffer
>    is empty."
> 
>    If I am concerned that a line will not fit in memory, how do
>    I know that the sequence of spaces at the end of a line will
>    fit in memory (the look-ahead buffer)? The look-ahead buffer
>    could be replaced by a counter. If you are paranoid, you
>    would use a 64-bit counter and check it for overflow!
> 
>    Is it worth the effort with a look-ahead buffer and
>    sequential access? Should you just read a line, assuming
>    that a line will always fit into memory, and strip the
>    blanks the easy way, i.e., using random access? TIA for any
>    comments!
> 
>    1
> 
>    an example of a traditional scanner:
> 
>    It only ever calls "GetCh", never "GetLine". The code could
>    be easier to write by reading a whole line and then just
>    using functions that can look at that line using random
>    access to get the next symbol (maybe using regular
>    expressions). But a traditional scanner carefully only ever
>    reads a single character and manages a state.
> 
> PROCEDURE GetSym;
> 
> VAR     i          : CARDINAL;
> 
> BEGIN
>    WHILE  ch <= ' '  DO  GetCh  END;
>    IF  ch = '/'  THEN
>      SkipLine;
>      WHILE  ch <= ' '  DO  GetCh  END
>    END;
>    IF  (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A')  THEN
>      i := 0;
>      sym := literal;
>      REPEAT
>        IF  i < IdLength  THEN
>          id [i] := ch;
>          INC (i)
>        END;
>        IF  ch > 'Z' THEN  sym := ident  END;
>        GetCh
>        ...
> 
> 

Because of the particulars of this problem, you don't need a look-ahead 
buffer, just a count of spaces you have seen and what character is after 
the spaces.

If you had to handle multiple types of whitespace (arbitrary mix of tabs 
and spaces for example) then you would need a buffer, and you need to 
try and hand;e the case of that sequence being "too long".

In general, parsers need a way to report that the file it too 
"complicated" for it.

Even the simple counter version has a limit, when ever the type of the 
counter overflows

Back to comp.programming | Previous | Next | Find similar

Thread

Re: Scanning Richard Damon <Richard@Damon-Family.org> - 2023-01-19 07:48 -0500

csiph-web