Re: How can the speed of a scanner be independent of the number of rules?

Path	csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From	Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups	comp.compilers
Subject	Re: How can the speed of a scanner be independent of the number of rules?
Date	Wed, 23 Mar 2022 21:27:29 -0000 (UTC)
Organization	A noiseless patient Spider
Lines	108
Sender	news@iecc.com
Approved	comp.compilers@iecc.com
Message-ID	<22-03-048@comp.compilers> (permalink)
References	<22-03-047@comp.compilers>
Injection-Info	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="64186"; mail-complaints-to="abuse@iecc.com"
Keywords	lex, performance, comment
Posted-Date	23 Mar 2022 17:49:27 EDT
X-submission-address	compilers@iecc.com
X-moderator-address	compilers-request@iecc.com
X-FAQ-and-archives	http://compilers.iecc.com
Xref	csiph.com comp.compilers:2950

Show key headers only | View raw

On 2022-03-23, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> On page 48 of the Flex manual [1] it says this amazing thing:
>
> Note that adding rules does not slow down the scanner! The speed of the
> scanner is independent of the number of rules or (modulo the considerations
> given at the beginning of this section) how complicated the rules are with
> regard to operators such as '*' and '|'.
>
> That is amazing! And counterintuitive. How can it possibly be that a scanner
> containing 1000 rules can operate as fast as a scanner containing 10 rules?
> Would you give some intuition to help me understand this, please?

To understand this, we must firstly understand the basic assumption of
the comparison: that the input test case presented to both versions of
the scanner is exactly the same. Of course, otherwise it's apples and
oranges, right?

So then, suppose we have some existing scanner with 10 rules, which
correctly tokenizes an input, Then suppose we add 990 rules to it. None
of these rules take precedence over the 10 rules, and so the the input
is handled by the same rules.

The result is that although the new scanner has a larger automaton with
more states, the state space being traversed in response to the original
input test case is still the same: it's traversing the same state
transitions coming from the ten rules.

Of course, a language that requires 1000 tokenizing rules will be slower
to handle if the average input test cases actually exercise most of the
rules: i.e. the instances of the language that actually occur do trigger
hundreds of different rules.

In terms of theoretical computer science, it cannot be true that there
is no slowdown regardless of the number of rules added. This is because
the rules are compiled into tables, and tables are indexed by integers.
Integers have to get wider (more bits) with increasing table size.

In performance testing on real macines, we are shielded from this effect
in situations in which we have nothing but 32 or 64 bit integers and
pointers regardless of the test size (which never exceeds the capacity
of the equipment).

There is another thing to consider which is that the claimed property
is true because of the compilatio technology used for handling the Lex
patterns: they are compiled to a DFA.

Lex more or less combines all of the pattern rules into a single giant
regex, as if by a kind of disjunction operator: R1|R2|R3|...

If NFA simulation were being used to run the giant regex, then it would
indeed get slower due to more rules. The reason is that the states of
a NFA simulator consist of *sets* of states of the NFA graph.
The more NFA states that are activated by a given
input character, the the more NFA states the simulator has to stuff into
the set object that represents a simulation state.

For instance suppose three of your 10 rules match a token starting
with a.  When the first input symbol of the input is a, then the
NFA simulator has be in three NFA states at the same time, corresponding
to the NFA graph's branches into the three sub-graphs for the tree
rules. The simulator adds those three states to a set which represents
its state. If you add a hundred rules that can match starting with a,
then now you have 103 things that have to go into that set.

But lex does this all this set arithmetic at compile time by converting
NFA to DFA.   The "subset construction" algorithm identifies sets of NFA
states that are activated by input symbols, and turns them into simple
states (that can be given integer indices or whatever). The identity of
a DFA state that represents an NFA being in 500 states simultaneously is
no larger than that of a DFA state representing the NFA being in 3 states.

But, here is where should be a little suspicious about the claim.
The DFA states from a Lex job that has more rules may have more
transitions out of them. There could be some caching effect there
that robs a little performance. The original 10 rules are not
necessarily going to be co-located in the a compact area of memory which
caches as well as it did before.

Integration of the lexer into the larger program can make a difference.

If the lexer is in a test case that does nothing but discard tokens,
it may be that even though the 1000 rule lexer has a bigger cache
footprint, it doesn't matter.

But now add a parser in there with its own state spaces to traverse, and
the combined cache footprint now goes over some "knee".

Basically, this is something that really needs to be tested, and
preferrably by a researcher who knows how to contrive the rules both
to tickle the worst cases out of Lex, as well as cases that are
representative of "real world" use, and report on both of these.

The claim that Lex doesn't get slower with more rules could well succumb
to some pathological cases and thus shown not to be literally true.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[You are right that if it's dividing input into more tokens, it'll
be slower.  But imagine I'm writing a COBOL lexer, and the 990 new
rules are all literal keywords, so it's the same number of tokens,
just moving the keyword recognition into the DFA instead of recognizing
an identifier-or-keyword string and checking the keywords
outside the lexer.  I'd think the lexer speed would be the same, maybe
even a little faster depending on how fast the other keyword checker
was. -John]

Thread

How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-23 18:49 +0000
  Re: How can the speed of a scanner be independent of the number of rules? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-23 21:27 +0000
    Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-23 19:57 -0700
    Re: How can the speed of a scanner be independent of the number of rules? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 13:12 +0200
    Re: How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-24 11:53 +0000
      Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:56 -0700
        Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-24 13:08 -0700
  Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 02:01 -0700
    Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:18 -0700
      Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 22:32 +0200
        Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 08:04 +0000
          Re: Parallel lexers by chunking the input gah4 <gah4@u.washington.edu> - 2022-03-25 07:58 -0700
          Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 17:47 +0000
            Parallel lexers by chunking the input. Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 00:36 +0200
            Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 01:00 +0200
            Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-26 09:27 +0000

csiph-web