Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.programming > #16352
| Newsgroups | comp.programming |
|---|---|
| Date | 2023-01-27 01:46 -0800 |
| References | <Scanning-20230119123241@ram.dialup.fu-berlin.de> <tqbdu1$1hm7a$1@dont-email.me> |
| Message-ID | <5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com> (permalink) |
| Subject | Re: Scanning |
| From | V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee> |
You are a devil !
On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote:
> On 19/01/2023 12:10 pm, Stefan Ram wrote:
> > Some idle thoughts about scanning (lexical analysis, or
> > rather what comes before it) ...
> >
> > Let's take a very simple task: This scanner for text files
> > has nothing more to do than to return every character,
> > except to strip the spaces at the end of a line.
> >
> > It is a function "get_next_token" that on each call will
> > return the next character from a file to its client (caller),
> > except that spaces at the end of a line will skipped.
> >
> > So we read the line and strip the spaces. (One line in
> > Python.)
> >
> > But how do I know in advance if the line will fit into
> > memory?
> >
> > Perhaps because of such fears, traditional scanners¹ do not
> > read lines or, Heaven forbid, files, but only characters!
> >
> > They do not use random access with respect to the text to be
> > scanned, but sequential access, although things would be
> > easier with random access.
> >
> > So how would you do it with this style of programming (never
> > reading the whole line into memory)?
> >
> > "I read a character. If it's a space, I peek at the next
> > character, if that's a space, I start adding spaces to my
> > look-ahead buffer. If an EOL is encountered, the look-ahead
> > buffer is discarded. Otherwise, I have to start feeding my
> > client from the lookahead buffer until the lookahead buffer
> > is empty."
> >
> > If I am concerned that a line will not fit in memory, how do
> > I know that the sequence of spaces at the end of a line will
> > fit in memory (the look-ahead buffer)? The look-ahead buffer
> > could be replaced by a counter. If you are paranoid, you
> > would use a 64-bit counter and check it for overflow!
> >
> > Is it worth the effort with a look-ahead buffer and
> > sequential access? Should you just read a line, assuming
> > that a line will always fit into memory, and strip the
> > blanks the easy way, i.e., using random access? TIA for any
> > comments!
> >
> > 1
> >
> > an example of a traditional scanner:
> >
> > It only ever calls "GetCh", never "GetLine". The code could
> > be easier to write by reading a whole line and then just
> > using functions that can look at that line using random
> > access to get the next symbol (maybe using regular
> > expressions). But a traditional scanner carefully only ever
> > reads a single character and manages a state.
> >
> > PROCEDURE GetSym;
> >
> > VAR i : CARDINAL;
> >
> > BEGIN
> > WHILE ch <= ' ' DO GetCh END;
> > IF ch = '/' THEN
> > SkipLine;
> > WHILE ch <= ' ' DO GetCh END
> > END;
> > IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
> > i := 0;
> > sym := literal;
> > REPEAT
> > IF i < IdLength THEN
> > id [i] := ch;
> > INC (i)
> > END;
> > IF ch > 'Z' THEN sym := ident END;
> > GetCh
> > ...
>
> man 3 realloc
>
> This was a perennial comp.lang.c topic back in the day.
>
> My interface looked (and still looks) like this:
>
> #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
> #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
> #define FGDATA_REDUCE 1
>
> int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
> *fp, unsigned int flags, size_t *plen);
>
> It's easier to use than it might look:
>
> char *data = NULL; /* where will the data go? NULL is fine */
> size_t size = 0; /* how much space do we have right now? */
> size_t len = 0; /* after call, holds line length */
>
> while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
> {
> if(len > 0)
>
> If you want fgetline.c and don't have 20 years of clc archives,
> just yell.
>
> --
> Richard Heathfield
> Email: rjh at cpax dot org dot uk
> "Usenet is a strange place" - dmr 29 July 1999
> Sig line 4 vacant - apply within
Back to comp.programming | Previous | Next — Previous in thread | Find similar
Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 12:43 +0000 Re: Scanning Richard Heathfield <rjh@cpax.org.uk> - 2023-01-19 13:56 +0000 Re: Scanning V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee> - 2023-01-27 01:46 -0800
csiph-web