X-Received: by 2002:a05:620a:134e:b0:706:49fb:8049 with SMTP id c14-20020a05620a134e00b0070649fb8049mr933587qkl.36.1674812760909; Fri, 27 Jan 2023 01:46:00 -0800 (PST) X-Received: by 2002:a05:6808:7db:b0:367:163e:a5e with SMTP id f27-20020a05680807db00b00367163e0a5emr1694860oij.162.1674812760648; Fri, 27 Jan 2023 01:46:00 -0800 (PST) Path: csiph.com!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail Newsgroups: comp.programming Date: Fri, 27 Jan 2023 01:46:00 -0800 (PST) In-Reply-To: Injection-Info: google-groups.googlegroups.com; posting-host=82.131.36.26; posting-account=ogslnwoAAACd9vU9PADzlWBA81fSuNpL NNTP-Posting-Host: 82.131.36.26 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com> Subject: Re: Scanning From: V V V V V V V V V V V V V V V V V V Injection-Date: Fri, 27 Jan 2023 09:46:00 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Received-Bytes: 5464 Xref: csiph.com comp.programming:16352 You are a devil ! On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote= : > On 19/01/2023 12:10 pm, Stefan Ram wrote:=20 > > Some idle thoughts about scanning (lexical analysis, or=20 > > rather what comes before it) ...=20 > >=20 > > Let's take a very simple task: This scanner for text files=20 > > has nothing more to do than to return every character,=20 > > except to strip the spaces at the end of a line.=20 > >=20 > > It is a function "get_next_token" that on each call will=20 > > return the next character from a file to its client (caller),=20 > > except that spaces at the end of a line will skipped.=20 > >=20 > > So we read the line and strip the spaces. (One line in=20 > > Python.)=20 > >=20 > > But how do I know in advance if the line will fit into=20 > > memory?=20 > >=20 > > Perhaps because of such fears, traditional scanners=C2=B9 do not=20 > > read lines or, Heaven forbid, files, but only characters!=20 > >=20 > > They do not use random access with respect to the text to be=20 > > scanned, but sequential access, although things would be=20 > > easier with random access.=20 > >=20 > > So how would you do it with this style of programming (never=20 > > reading the whole line into memory)?=20 > >=20 > > "I read a character. If it's a space, I peek at the next=20 > > character, if that's a space, I start adding spaces to my=20 > > look-ahead buffer. If an EOL is encountered, the look-ahead=20 > > buffer is discarded. Otherwise, I have to start feeding my=20 > > client from the lookahead buffer until the lookahead buffer=20 > > is empty."=20 > >=20 > > If I am concerned that a line will not fit in memory, how do=20 > > I know that the sequence of spaces at the end of a line will=20 > > fit in memory (the look-ahead buffer)? The look-ahead buffer=20 > > could be replaced by a counter. If you are paranoid, you=20 > > would use a 64-bit counter and check it for overflow!=20 > >=20 > > Is it worth the effort with a look-ahead buffer and=20 > > sequential access? Should you just read a line, assuming=20 > > that a line will always fit into memory, and strip the=20 > > blanks the easy way, i.e., using random access? TIA for any=20 > > comments!=20 > >=20 > > 1=20 > >=20 > > an example of a traditional scanner:=20 > >=20 > > It only ever calls "GetCh", never "GetLine". The code could=20 > > be easier to write by reading a whole line and then just=20 > > using functions that can look at that line using random=20 > > access to get the next symbol (maybe using regular=20 > > expressions). But a traditional scanner carefully only ever=20 > > reads a single character and manages a state.=20 > >=20 > > PROCEDURE GetSym;=20 > >=20 > > VAR i : CARDINAL;=20 > >=20 > > BEGIN=20 > > WHILE ch <=3D ' ' DO GetCh END;=20 > > IF ch =3D '/' THEN=20 > > SkipLine;=20 > > WHILE ch <=3D ' ' DO GetCh END=20 > > END;=20 > > IF (CAP (ch) <=3D 'Z') AND (CAP (ch) >=3D 'A') THEN=20 > > i :=3D 0;=20 > > sym :=3D literal;=20 > > REPEAT=20 > > IF i < IdLength THEN=20 > > id [i] :=3D ch;=20 > > INC (i)=20 > > END;=20 > > IF ch > 'Z' THEN sym :=3D ident END;=20 > > GetCh=20 > > ...=20 >=20 > man 3 realloc=20 >=20 > This was a perennial comp.lang.c topic back in the day.=20 >=20 > My interface looked (and still looks) like this:=20 >=20 > #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */=20 > #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")=20 > #define FGDATA_REDUCE 1=20 >=20 > int fgetline(char **line, size_t *size, size_t maxrecsize, FILE=20 > *fp, unsigned int flags, size_t *plen);=20 >=20 > It's easier to use than it might look:=20 >=20 > char *data =3D NULL; /* where will the data go? NULL is fine */=20 > size_t size =3D 0; /* how much space do we have right now? */=20 > size_t len =3D 0; /* after call, holds line length */=20 >=20 > while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) =3D=3D 0)=20 > {=20 > if(len > 0)=20 >=20 > If you want fgetline.c and don't have 20 years of clc archives,=20 > just yell.=20 >=20 > --=20 > Richard Heathfield=20 > Email: rjh at cpax dot org dot uk=20 > "Usenet is a strange place" - dmr 29 July 1999=20 > Sig line 4 vacant - apply within