Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: How can the speed of a scanner be independent of the number of rules?
Date: Thu, 24 Mar 2022 13:12:20 +0200
Organization: Compilers Central
Lines: 135
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-054@comp.compilers>
References: <22-03-047@comp.compilers> <22-03-048@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="69235"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 24 Mar 2022 13:29:43 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2953

I see that Kaz Kylheku has written a thought out and well-reasoned
response.  However, it isn't based upon actual analysis of data and
uses a strawman that I reject having done (while at Intel) such an
analysis while designing a register expression acceleration chip.

So, let me first deal with the strawman.

> None of these rules take precedence over the 10 rules, and so the input is handled by the same rules.

That assumption is not necessary.  Even if the input is distributed
over all 1000 rules, the performance will still be substantially the
same.  This is due to the use of a DFA.  In a DFA, the speed is
constant because the same code handles each state, roughly something
like this.

s := initial state;
forever() do {
   c := new character from input;
   s := new state[c];
   if (s == end of token) return token;
   token :+= c;
  }

On most modern computers, the cost is dominated by the cost to fetch
(via an indirect/indexed access) the next state.  Now, cache effects
can impact this, but most modern caches are large enough that this is
a negligible effect.  Moreover, it is possible to consider the case
where all cache probes are misses and it goes to memory.  Although,
that is significantly slower, it is still constant, and more
importantly the ratio of cache hits to misses is essentially also a
constant ratio.  Only if you are increasing the size of the DFA to the
point where it starts approaching (or exceeding) the cache size does
this become an issue.  That would be the "knee" Kaz mentioned. 1000
tokens is usually not sufficient to do that.

As I mentioned we measured this at Intel, not only for programming
lexers but also for Snort, where the number of regular expressions is
much larger.  We even looked at ClamAV where we considered making our
regular expression engine into a tool to calculate the "signatures"
(essentially performing a cryptographic hash).  Moreover, this goes to
the argument that hashing algorithms are essentially O(1).

Now, theory and practice don't always precisely line up, so in a real
application, one would benchmark to verify this, but over a large
range of regular expression sets, I have found it to hold, having done
that benchmarking.

This leads to another experiment we did at Intel as part of the
Hyperscan software effort.  Hyperscan is generally a Glushkov inspired
NFA engine, but when possible (i.e. the DFAs are small enough) it does
subset construction to turn them into DFAs.  However, when the DFAs
are too large, the construction fails and it falls back to the NFA
engine.  Because DFAs can be significantly faster than NFAs, we wanted
an algorithm that allowed us to split an NFA into a small set  of
parallel DFAs (8 or fewer).  To be practical though, we did not want
the overhead of threads running on multiple cores accessing the same
input data.  That would in many cases have completely eroded the DFA
performance boost.  So, I was tasked with finding a way of running 8
threads in one thread, by interleaving the code of them.   From the
above code and analysis, we know the issue is that next state fetch,
and dependencies of it on the statements before and after it.  Even
with a cache hit, this is longer than the instruction pipeline.  So,
the answer was to move the dependencies away from each other allowing
the fetch to occur and doing other work in the pipeline in the
meantime.  With the help of VTune this was achieved.  We got our 8
parallel DFAs to operate in the same time, it took to execute a single
DFA.  Note that this was achieved without the use of any assembly
language coding.  Our code was still portable C code with no compiler
specific features.

The last point I want to make in this regard is that in most cases,
regular expressions, tend to have a "bushy" part where the code fans
out to a variety of states, but much larger "skinny" parts, where
generally only a few (usually only 1, following a variation on Zipf's
distribution law) possible out going transitions.  This is
particularly easy to see if you are doing something like the
Aho-Corasick algorithm, where the fanout is usually concentrated in
the first few (often 2 or 3) states at the head of the algorithm, the
initial state and its successors.  The regular expression hardware we
built was actually optimized for this property, with a fast dispatch
for the head states and then threads for the tail states.  (If you can
read patent-ese language, you can tease out the details. I found that
challenging despite being the original author of the idea.)

--------

Now, having hopefully covered most of the reasons why with a DFA, the
worry is mostly unfounded.  I want to go over the other case.
Particularly as it applies to mistakes one can make.  This most often
occurs in hand-written parsers and or lexers.

First, I want to note that the LR(0) machine construction algorithm is
essentially a variation on the subset construction algorithm.  That
means that your LR machine essentially has a DFA inside it.  This is
scary to many people as they aren't used to reading state machines,
this is compounded when the tables are "compressed".  However, despite
that the same effect holds.  The running time of an LR parser is still
generally linear in the input length and each state transition is
O(1).

Aside: with GLR parsers and parse forests, this guarantee is somewhat
eroded, but it is usually still acceptable as it is often only eroded
over short ambiguous stretches.  And here is where my understanding of
the theory reaches its boundary.  I think in the worst case, it still
has O(n**3) complexity based upon a German theory paper I attempted
unsuccessfully to read completely.  Naively, it seems that the bound
should be n**2, but I cannot do the calculations to prove that and the
paper seemed to have good evidence for its bound, i.e. that the
forests could grow at n**2 at each new token, rather than just n.

So, how does a naive implementation break this?  Many recursive
descent parsers don't use a computed goto (case statement with an
integer selector) to do the dispatch.  Instead, they use a series of
if statements.  The result is that you have now linearized the lookup
and you need an average of n/2 comparisons to find the next state.
That means the next state lookup is no longer O(1), but O(n).

You sometimes also see this in keyword lookup.  They do so by
comparing the text to a sequence of strings to determine whether the
token is a keyword or not.  Again, you have introduced an O(n) factor
by serializing the comparison.  Now, if the list is not long, this
factor isn't sufficient to impact the total performance, but it is
performance left on-the-table.  The sequence of ifs may be easier to
read than a state machine, but that is their only advantage.  Thus, if
you want to speed your keyword lookup up, use a simple hashing
algorithm so that the hash cost is low, but then you have only a
limited number of keywords (in each bucket) to compare to.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------