Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.programming > #16319
| From | Richard Heathfield <rjh@cpax.org.uk> |
|---|---|
| Newsgroups | comp.programming |
| Subject | Re: Scanning |
| Date | 2023-01-19 12:43 +0000 |
| Organization | Fix this later |
| Message-ID | <tqbdu1$1hm7a$1@dont-email.me> (permalink) |
| References | <Scanning-20230119123241@ram.dialup.fu-berlin.de> |
On 19/01/2023 12:10 pm, Stefan Ram wrote:
> Some idle thoughts about scanning (lexical analysis, or
> rather what comes before it) ...
>
> Let's take a very simple task: This scanner for text files
> has nothing more to do than to return every character,
> except to strip the spaces at the end of a line.
>
> It is a function "get_next_token" that on each call will
> return the next character from a file to its client (caller),
> except that spaces at the end of a line will skipped.
>
> So we read the line and strip the spaces. (One line in
> Python.)
>
> But how do I know in advance if the line will fit into
> memory?
>
> Perhaps because of such fears, traditional scanners¹ do not
> read lines or, Heaven forbid, files, but only characters!
>
> They do not use random access with respect to the text to be
> scanned, but sequential access, although things would be
> easier with random access.
>
> So how would you do it with this style of programming (never
> reading the whole line into memory)?
>
> "I read a character. If it's a space, I peek at the next
> character, if that's a space, I start adding spaces to my
> look-ahead buffer. If an EOL is encountered, the look-ahead
> buffer is discarded. Otherwise, I have to start feeding my
> client from the lookahead buffer until the lookahead buffer
> is empty."
>
> If I am concerned that a line will not fit in memory, how do
> I know that the sequence of spaces at the end of a line will
> fit in memory (the look-ahead buffer)? The look-ahead buffer
> could be replaced by a counter. If you are paranoid, you
> would use a 64-bit counter and check it for overflow!
>
> Is it worth the effort with a look-ahead buffer and
> sequential access? Should you just read a line, assuming
> that a line will always fit into memory, and strip the
> blanks the easy way, i.e., using random access? TIA for any
> comments!
>
> 1
>
> an example of a traditional scanner:
>
> It only ever calls "GetCh", never "GetLine". The code could
> be easier to write by reading a whole line and then just
> using functions that can look at that line using random
> access to get the next symbol (maybe using regular
> expressions). But a traditional scanner carefully only ever
> reads a single character and manages a state.
>
> PROCEDURE GetSym;
>
> VAR i : CARDINAL;
>
> BEGIN
> WHILE ch <= ' ' DO GetCh END;
> IF ch = '/' THEN
> SkipLine;
> WHILE ch <= ' ' DO GetCh END
> END;
> IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
> i := 0;
> sym := literal;
> REPEAT
> IF i < IdLength THEN
> id [i] := ch;
> INC (i)
> END;
> IF ch > 'Z' THEN sym := ident END;
> GetCh
> ...
man 3 realloc
This was a perennial comp.lang.c topic back in the day.
My interface looked (and still looks) like this:
#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE 1
int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
*fp, unsigned int flags, size_t *plen);
It's easier to use than it might look:
char *data = NULL; /* where will the data go? NULL is fine */
size_t size = 0; /* how much space do we have right now? */
size_t len = 0; /* after call, holds line length */
while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
{
if(len > 0)
If you want fgetline.c and don't have 20 years of clc archives,
just yell.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
Back to comp.programming | Previous | Next — Next in thread | Find similar
Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 12:43 +0000 Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 13:56 +0000 Re: Scanning V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee> - 2023-01-27 01:46 -0800
csiph-web