Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: gah4 <gah4@u.washington.edu>
Newsgroups: comp.compilers
Subject: Re: How can the speed of a scanner be independent of the number of rules?
Date: Wed, 23 Mar 2022 19:57:45 -0700 (PDT)
Organization: Compilers Central
Lines: 32
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-050@comp.compilers>
References: <22-03-047@comp.compilers> <22-03-048@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="64200"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 24 Mar 2022 13:02:35 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <22-03-048@comp.compilers>
Xref: csiph.com comp.compilers:2951

On Wednesday, March 23, 2022 at 2:49:30 PM UTC-7, Kaz Kylheku wrote:
> On 2022-03-23, Roger L Costello <cost...@mitre.org> wrote:

(snip)
> > Note that adding rules does not slow down the scanner! The speed of the
> > scanner is independent of the number of rules or (modulo the considerations
> > given at the beginning of this section) how complicated the rules are with
> > regard to operators such as '*' and '|'.

(snip)

> In terms of theoretical computer science, it cannot be true that there
> is no slowdown regardless of the number of rules added. This is because
> the rules are compiled into tables, and tables are indexed by integers.
> Integers have to get wider (more bits) with increasing table size.

Yes, but 32 bit integers will get a huge number of states.

(snip)

> If the lexer is in a test case that does nothing but discard tokens,
> it may be that even though the 1000 rule lexer has a bigger cache
> footprint, it doesn't matter.

Yes, cache is the complication of just about any speed comparison.
And in this case, it depends not only on the scanner, but could be
sensitive to the actual input data.

It would seem that you could do the comparison based on the same
scanner, but using either UTF-8 or UTF-16 coded characters.

That would be closer to an apple vs. apple comparison.