X-Received: by 2002:a05:620a:134e:b0:706:49fb:8049 with SMTP id c14-20020a05620a134e00b0070649fb8049mr933587qkl.36.1674812760909; Fri, 27 Jan 2023 01:46:00 -0800 (PST)
X-Received: by 2002:a05:6808:7db:b0:367:163e:a5e with SMTP id f27-20020a05680807db00b00367163e0a5emr1694860oij.162.1674812760648; Fri, 27 Jan 2023 01:46:00 -0800 (PST)
Path: csiph.com!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.programming
Date: Fri, 27 Jan 2023 01:46:00 -0800 (PST)
In-Reply-To: <tqbdu1$1hm7a$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=82.131.36.26; posting-account=ogslnwoAAACd9vU9PADzlWBA81fSuNpL
NNTP-Posting-Host: 82.131.36.26
References: <Scanning-20230119123241@ram.dialup.fu-berlin.de> <tqbdu1$1hm7a$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5c08c3a3-514d-4c19-9d31-8ccd8f64b2bcn@googlegroups.com>
Subject: Re: Scanning
From: V V V V V V V V V V V V V V V V V V <vvvvvvvvaaaaaaaaaaaaaaa@mail.ee>
Injection-Date: Fri, 27 Jan 2023 09:46:00 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5464
Xref: csiph.com comp.programming:16352

You are a devil !




On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote=
:
> On 19/01/2023 12:10 pm, Stefan Ram wrote:=20
> > Some idle thoughts about scanning (lexical analysis, or=20
> > rather what comes before it) ...=20
> >=20
> > Let's take a very simple task: This scanner for text files=20
> > has nothing more to do than to return every character,=20
> > except to strip the spaces at the end of a line.=20
> >=20
> > It is a function "get_next_token" that on each call will=20
> > return the next character from a file to its client (caller),=20
> > except that spaces at the end of a line will skipped.=20
> >=20
> > So we read the line and strip the spaces. (One line in=20
> > Python.)=20
> >=20
> > But how do I know in advance if the line will fit into=20
> > memory?=20
> >=20
> > Perhaps because of such fears, traditional scanners=C2=B9 do not=20
> > read lines or, Heaven forbid, files, but only characters!=20
> >=20
> > They do not use random access with respect to the text to be=20
> > scanned, but sequential access, although things would be=20
> > easier with random access.=20
> >=20
> > So how would you do it with this style of programming (never=20
> > reading the whole line into memory)?=20
> >=20
> > "I read a character. If it's a space, I peek at the next=20
> > character, if that's a space, I start adding spaces to my=20
> > look-ahead buffer. If an EOL is encountered, the look-ahead=20
> > buffer is discarded. Otherwise, I have to start feeding my=20
> > client from the lookahead buffer until the lookahead buffer=20
> > is empty."=20
> >=20
> > If I am concerned that a line will not fit in memory, how do=20
> > I know that the sequence of spaces at the end of a line will=20
> > fit in memory (the look-ahead buffer)? The look-ahead buffer=20
> > could be replaced by a counter. If you are paranoid, you=20
> > would use a 64-bit counter and check it for overflow!=20
> >=20
> > Is it worth the effort with a look-ahead buffer and=20
> > sequential access? Should you just read a line, assuming=20
> > that a line will always fit into memory, and strip the=20
> > blanks the easy way, i.e., using random access? TIA for any=20
> > comments!=20
> >=20
> > 1=20
> >=20
> > an example of a traditional scanner:=20
> >=20
> > It only ever calls "GetCh", never "GetLine". The code could=20
> > be easier to write by reading a whole line and then just=20
> > using functions that can look at that line using random=20
> > access to get the next symbol (maybe using regular=20
> > expressions). But a traditional scanner carefully only ever=20
> > reads a single character and manages a state.=20
> >=20
> > PROCEDURE GetSym;=20
> >=20
> > VAR i : CARDINAL;=20
> >=20
> > BEGIN=20
> > WHILE ch <=3D ' ' DO GetCh END;=20
> > IF ch =3D '/' THEN=20
> > SkipLine;=20
> > WHILE ch <=3D ' ' DO GetCh END=20
> > END;=20
> > IF (CAP (ch) <=3D 'Z') AND (CAP (ch) >=3D 'A') THEN=20
> > i :=3D 0;=20
> > sym :=3D literal;=20
> > REPEAT=20
> > IF i < IdLength THEN=20
> > id [i] :=3D ch;=20
> > INC (i)=20
> > END;=20
> > IF ch > 'Z' THEN sym :=3D ident END;=20
> > GetCh=20
> > ...=20
>=20
> man 3 realloc=20
>=20
> This was a perennial comp.lang.c topic back in the day.=20
>=20
> My interface looked (and still looks) like this:=20
>=20
> #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */=20
> #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")=20
> #define FGDATA_REDUCE 1=20
>=20
> int fgetline(char **line, size_t *size, size_t maxrecsize, FILE=20
> *fp, unsigned int flags, size_t *plen);=20
>=20
> It's easier to use than it might look:=20
>=20
> char *data =3D NULL; /* where will the data go? NULL is fine */=20
> size_t size =3D 0; /* how much space do we have right now? */=20
> size_t len =3D 0; /* after call, holds line length */=20
>=20
> while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) =3D=3D 0)=20
> {=20
> if(len > 0)=20
>=20
> If you want fgetline.c and don't have 20 years of clc archives,=20
> just yell.=20
>=20
> --=20
> Richard Heathfield=20
> Email: rjh at cpax dot org dot uk=20
> "Usenet is a strange place" - dmr 29 July 1999=20
> Sig line 4 vacant - apply within