Groups > comp.compilers > #2949 > unrolled thread

How can the speed of a scanner be independent of the number of rules?

Started by	Roger L Costello <costello@mitre.org>
First post	2022-03-23 18:49 +0000
Last post	2022-03-26 09:27 +0000
Articles	16 — 6 participants

Back to article view | Back to comp.compilers

  How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-23 18:49 +0000
    Re: How can the speed of a scanner be independent of the number of rules? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-23 21:27 +0000
      Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-23 19:57 -0700
      Re: How can the speed of a scanner be independent of the number of rules? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 13:12 +0200
      Re: How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-24 11:53 +0000
        Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:56 -0700
          Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-24 13:08 -0700
    Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 02:01 -0700
      Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:18 -0700
        Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 22:32 +0200
          Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 08:04 +0000
            Re: Parallel lexers by chunking the input gah4 <gah4@u.washington.edu> - 2022-03-25 07:58 -0700
            Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 17:47 +0000
              Parallel lexers by chunking the input. Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 00:36 +0200
              Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 01:00 +0200
              Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-26 09:27 +0000

#2949 — How can the speed of a scanner be independent of the number of rules?

From	Roger L Costello <costello@mitre.org>
Date	2022-03-23 18:49 +0000
Subject	How can the speed of a scanner be independent of the number of rules?
Message-ID	<22-03-047@comp.compilers>

Hi Folks,

On page 48 of the Flex manual [1] it says this amazing thing:

Note that adding rules does not slow down the scanner! The speed of the
scanner is independent of the number of rules or (modulo the considerations
given at the beginning of this section) how complicated the rules are with
regard to operators such as '*' and '|'.

That is amazing! And counterintuitive. How can it possibly be that a scanner
containing 1000 rules can operate as fast as a scanner containing 10 rules?
Would you give some intuition to help me understand this, please?

/Roger

[1] https://epaperpress.com/lexandyacc/download/flex.pdf

[Flex compiles the rules into a finite state machine.  When the scanner
runs, it just looks up each character it reads in the table for the current
state to decide what to do.  Creating the state tables for 1000 rules takes
a lot longer than creating the tables for 10 rules, but that just happens
once when you build the scanner, not when it's running.
For more details on regular expressions and state machines, see any compiler
textbook.  It's one of the standrd topics. -John]

[toc] | [next] | [standalone]

#2950

From	Kaz Kylheku <480-992-1380@kylheku.com>
Date	2022-03-23 21:27 +0000
Message-ID	<22-03-048@comp.compilers>
In reply to	#2949

On 2022-03-23, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> On page 48 of the Flex manual [1] it says this amazing thing:
>
> Note that adding rules does not slow down the scanner! The speed of the
> scanner is independent of the number of rules or (modulo the considerations
> given at the beginning of this section) how complicated the rules are with
> regard to operators such as '*' and '|'.
>
> That is amazing! And counterintuitive. How can it possibly be that a scanner
> containing 1000 rules can operate as fast as a scanner containing 10 rules?
> Would you give some intuition to help me understand this, please?

To understand this, we must firstly understand the basic assumption of
the comparison: that the input test case presented to both versions of
the scanner is exactly the same. Of course, otherwise it's apples and
oranges, right?

So then, suppose we have some existing scanner with 10 rules, which
correctly tokenizes an input, Then suppose we add 990 rules to it. None
of these rules take precedence over the 10 rules, and so the the input
is handled by the same rules.

The result is that although the new scanner has a larger automaton with
more states, the state space being traversed in response to the original
input test case is still the same: it's traversing the same state
transitions coming from the ten rules.

Of course, a language that requires 1000 tokenizing rules will be slower
to handle if the average input test cases actually exercise most of the
rules: i.e. the instances of the language that actually occur do trigger
hundreds of different rules.

In terms of theoretical computer science, it cannot be true that there
is no slowdown regardless of the number of rules added. This is because
the rules are compiled into tables, and tables are indexed by integers.
Integers have to get wider (more bits) with increasing table size.

In performance testing on real macines, we are shielded from this effect
in situations in which we have nothing but 32 or 64 bit integers and
pointers regardless of the test size (which never exceeds the capacity
of the equipment).

There is another thing to consider which is that the claimed property
is true because of the compilatio technology used for handling the Lex
patterns: they are compiled to a DFA.

Lex more or less combines all of the pattern rules into a single giant
regex, as if by a kind of disjunction operator: R1|R2|R3|...

If NFA simulation were being used to run the giant regex, then it would
indeed get slower due to more rules. The reason is that the states of
a NFA simulator consist of *sets* of states of the NFA graph.
The more NFA states that are activated by a given
input character, the the more NFA states the simulator has to stuff into
the set object that represents a simulation state.

For instance suppose three of your 10 rules match a token starting
with a.  When the first input symbol of the input is a, then the
NFA simulator has be in three NFA states at the same time, corresponding
to the NFA graph's branches into the three sub-graphs for the tree
rules. The simulator adds those three states to a set which represents
its state. If you add a hundred rules that can match starting with a,
then now you have 103 things that have to go into that set.

But lex does this all this set arithmetic at compile time by converting
NFA to DFA.   The "subset construction" algorithm identifies sets of NFA
states that are activated by input symbols, and turns them into simple
states (that can be given integer indices or whatever). The identity of
a DFA state that represents an NFA being in 500 states simultaneously is
no larger than that of a DFA state representing the NFA being in 3 states.

But, here is where should be a little suspicious about the claim.
The DFA states from a Lex job that has more rules may have more
transitions out of them. There could be some caching effect there
that robs a little performance. The original 10 rules are not
necessarily going to be co-located in the a compact area of memory which
caches as well as it did before.

Integration of the lexer into the larger program can make a difference.

If the lexer is in a test case that does nothing but discard tokens,
it may be that even though the 1000 rule lexer has a bigger cache
footprint, it doesn't matter.

But now add a parser in there with its own state spaces to traverse, and
the combined cache footprint now goes over some "knee".

Basically, this is something that really needs to be tested, and
preferrably by a researcher who knows how to contrive the rules both
to tickle the worst cases out of Lex, as well as cases that are
representative of "real world" use, and report on both of these.

The claim that Lex doesn't get slower with more rules could well succumb
to some pathological cases and thus shown not to be literally true.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[You are right that if it's dividing input into more tokens, it'll
be slower.  But imagine I'm writing a COBOL lexer, and the 990 new
rules are all literal keywords, so it's the same number of tokens,
just moving the keyword recognition into the DFA instead of recognizing
an identifier-or-keyword string and checking the keywords
outside the lexer.  I'd think the lexer speed would be the same, maybe
even a little faster depending on how fast the other keyword checker
was. -John]

[toc] | [prev] | [next] | [standalone]

#2951

From	gah4 <gah4@u.washington.edu>
Date	2022-03-23 19:57 -0700
Message-ID	<22-03-050@comp.compilers>
In reply to	#2950

On Wednesday, March 23, 2022 at 2:49:30 PM UTC-7, Kaz Kylheku wrote:
> On 2022-03-23, Roger L Costello <cost...@mitre.org> wrote:

(snip)
> > Note that adding rules does not slow down the scanner! The speed of the
> > scanner is independent of the number of rules or (modulo the considerations
> > given at the beginning of this section) how complicated the rules are with
> > regard to operators such as '*' and '|'.

(snip)

> In terms of theoretical computer science, it cannot be true that there
> is no slowdown regardless of the number of rules added. This is because
> the rules are compiled into tables, and tables are indexed by integers.
> Integers have to get wider (more bits) with increasing table size.

Yes, but 32 bit integers will get a huge number of states.

(snip)

> If the lexer is in a test case that does nothing but discard tokens,
> it may be that even though the 1000 rule lexer has a bigger cache
> footprint, it doesn't matter.

Yes, cache is the complication of just about any speed comparison.
And in this case, it depends not only on the scanner, but could be
sensitive to the actual input data.

It would seem that you could do the comparison based on the same
scanner, but using either UTF-8 or UTF-16 coded characters.

That would be closer to an apple vs. apple comparison.

[toc] | [prev] | [next] | [standalone]

#2953

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-03-24 13:12 +0200
Message-ID	<22-03-054@comp.compilers>
In reply to	#2950

I see that Kaz Kylheku has written a thought out and well-reasoned
response. However, it isn't based upon actual analysis of data and
uses a strawman that I reject having done (while at Intel) such an
analysis while designing a register expression acceleration chip.

So, let me first deal with the strawman.

> None of these rules take precedence over the 10 rules, and so the input is handled by the same rules.

That assumption is not necessary. Even if the input is distributed
over all 1000 rules, the performance will still be substantially the
same. This is due to the use of a DFA. In a DFA, the speed is
constant because the same code handles each state, roughly something
like this.

s := initial state;
forever() do {
c := new character from input;
s := new state[c];
if (s == end of token) return token;
token :+= c;
}

On most modern computers, the cost is dominated by the cost to fetch
(via an indirect/indexed access) the next state. Now, cache effects
can impact this, but most modern caches are large enough that this is
a negligible effect. Moreover, it is possible to consider the case
where all cache probes are misses and it goes to memory. Although,
that is significantly slower, it is still constant, and more
importantly the ratio of cache hits to misses is essentially also a
constant ratio. Only if you are increasing the size of the DFA to the
point where it starts approaching (or exceeding) the cache size does
this become an issue. That would be the "knee" Kaz mentioned. 1000
tokens is usually not sufficient to do that.

As I mentioned we measured this at Intel, not only for programming
lexers but also for Snort, where the number of regular expressions is
much larger. We even looked at ClamAV where we considered making our
regular expression engine into a tool to calculate the "signatures"
(essentially performing a cryptographic hash). Moreover, this goes to
the argument that hashing algorithms are essentially O(1).

Now, theory and practice don't always precisely line up, so in a real
application, one would benchmark to verify this, but over a large
range of regular expression sets, I have found it to hold, having done
that benchmarking.

This leads to another experiment we did at Intel as part of the
Hyperscan software effort. Hyperscan is generally a Glushkov inspired
NFA engine, but when possible (i.e. the DFAs are small enough) it does
subset construction to turn them into DFAs. However, when the DFAs
are too large, the construction fails and it falls back to the NFA
engine. Because DFAs can be significantly faster than NFAs, we wanted
an algorithm that allowed us to split an NFA into a small set of
parallel DFAs (8 or fewer). To be practical though, we did not want
the overhead of threads running on multiple cores accessing the same
input data. That would in many cases have completely eroded the DFA
performance boost. So, I was tasked with finding a way of running 8
threads in one thread, by interleaving the code of them. From the
above code and analysis, we know the issue is that next state fetch,
and dependencies of it on the statements before and after it. Even
with a cache hit, this is longer than the instruction pipeline. So,
the answer was to move the dependencies away from each other allowing
the fetch to occur and doing other work in the pipeline in the
meantime. With the help of VTune this was achieved. We got our 8
parallel DFAs to operate in the same time, it took to execute a single
DFA. Note that this was achieved without the use of any assembly
language coding. Our code was still portable C code with no compiler
specific features.

The last point I want to make in this regard is that in most cases,
regular expressions, tend to have a "bushy" part where the code fans
out to a variety of states, but much larger "skinny" parts, where
generally only a few (usually only 1, following a variation on Zipf's
distribution law) possible out going transitions. This is
particularly easy to see if you are doing something like the
Aho-Corasick algorithm, where the fanout is usually concentrated in
the first few (often 2 or 3) states at the head of the algorithm, the
initial state and its successors. The regular expression hardware we
built was actually optimized for this property, with a fast dispatch
for the head states and then threads for the tail states. (If you can
read patent-ese language, you can tease out the details. I found that
challenging despite being the original author of the idea.)

--------

Now, having hopefully covered most of the reasons why with a DFA, the
worry is mostly unfounded. I want to go over the other case.
Particularly as it applies to mistakes one can make. This most often
occurs in hand-written parsers and or lexers.

First, I want to note that the LR(0) machine construction algorithm is
essentially a variation on the subset construction algorithm. That
means that your LR machine essentially has a DFA inside it. This is
scary to many people as they aren't used to reading state machines,
this is compounded when the tables are "compressed". However, despite
that the same effect holds. The running time of an LR parser is still
generally linear in the input length and each state transition is
O(1).

Aside: with GLR parsers and parse forests, this guarantee is somewhat
eroded, but it is usually still acceptable as it is often only eroded
over short ambiguous stretches. And here is where my understanding of
the theory reaches its boundary. I think in the worst case, it still
has O(n**3) complexity based upon a German theory paper I attempted
unsuccessfully to read completely. Naively, it seems that the bound
should be n**2, but I cannot do the calculations to prove that and the
paper seemed to have good evidence for its bound, i.e. that the
forests could grow at n**2 at each new token, rather than just n.

So, how does a naive implementation break this? Many recursive
descent parsers don't use a computed goto (case statement with an
integer selector) to do the dispatch. Instead, they use a series of
if statements. The result is that you have now linearized the lookup
and you need an average of n/2 comparisons to find the next state.
That means the next state lookup is no longer O(1), but O(n).

You sometimes also see this in keyword lookup. They do so by
comparing the text to a sequence of strings to determine whether the
token is a keyword or not. Again, you have introduced an O(n) factor
by serializing the comparison. Now, if the list is not long, this
factor isn't sufficient to impact the total performance, but it is
performance left on-the-table. The sequence of ifs may be easier to
read than a state machine, but that is their only advantage. Thus, if
you want to speed your keyword lookup up, use a simple hashing
algorithm so that the hash cost is low, but then you have only a
limited number of keywords (in each bucket) to compare to.

--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2954

From	Roger L Costello <costello@mitre.org>
Date	2022-03-24 11:53 +0000
Message-ID	<22-03-056@comp.compilers>
In reply to	#2950

Kaz Kylheku wrote:

> suppose we have some existing scanner with 10 rules,
> which correctly tokenizes an input. Then suppose we
> add 990 rules to it. None of these rules take precedence
> over the 10 rules, and so the the input is handled by the
> same rules.

Ouch!!!

Such a letdown. So the statement "adding rules does not slow down the scanner"
really isn't remarkable or awesome. Add 990 more irrelevant rules, and the
scanner operates just as fast. Big deal.

Thanks Kaz.

/Roger
[But see other messages -- adding 990 more relevant rules doesn't slow it down either. -John]

[toc] | [prev] | [next] | [standalone]

#2956

From	Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Date	2022-03-24 11:56 -0700
Message-ID	<22-03-059@comp.compilers>
In reply to	#2954

On Thursday, March 24, 2022 at 6:39:31 PM UTC+1, Roger L Costello wrote:
> Kaz Kylheku wrote:
>
> > suppose we have some existing scanner with 10 rules,
> > which correctly tokenizes an input. Then suppose we
> > add 990 rules to it. None of these rules take precedence
> > over the 10 rules, and so the the input is handled by the
> > same rules.
>
> Ouch!!!
>
> Such a letdown. So the statement "adding rules does not slow down the scanner"
> really isn't remarkable or awesome. Add 990 more irrelevant rules, and the
> scanner operates just as fast. Big deal.

That is, in my opinion, an invalid way of looking at scanning, because it
isn't computer science. A better way (I am not saying "the only right way") is
for example to compare the size of integers, the size of a memory address,
that a typical year-2022 CPU can process/access/read/write/lookup using in a
single CPU instruction. The fact is that the size of those integers and
addresses (such as: 32 bits, 64 bits) is extremely large compared to the
number of rules and complexity of any grammar that a human is able to directly
(by hand, by means of keyboard/mouse/touchscreen) enter into a computer. And,
creating a program just to generate or download an extremely large complex
grammar, containing quantities exceeding the capabilities of a single CPU
instruction and exceeding the size of memory installed in a typical year-2022
computer, is pointless.

-atom

[toc] | [prev] | [next] | [standalone]

#2957

From	gah4 <gah4@u.washington.edu>
Date	2022-03-24 13:08 -0700
Message-ID	<22-03-062@comp.compilers>
In reply to	#2956

On Thursday, March 24, 2022 at 12:19:52 PM UTC-7, Jan Ziak wrote:

(snip)

> That is, in my opinion, an invalid way of looking at scanning, because it
> isn't computer science. A better way (I am not saying "the only right way") is
> for example to compare the size of integers, the size of a memory address,
> that a typical year-2022 CPU can process/access/read/write/lookup using in a
> single CPU instruction.

This distinction comes up more often for sorting algorithms,
but I suppose it applies here, too.

If one wants to sort a gigabyte of bytes, that is elements of length
one (8 bit) byte, it is faster to count them, and then generate an output
file with the appropriate number of each byte.

Depending on how you do the measurement, that violates some rule
on sorting algorithms.

> The fact is that the size of those integers and
> addresses (such as: 32 bits, 64 bits) is extremely large compared to the
> number of rules and complexity of any grammar that a human is able to directly
> (by hand, by means of keyboard/mouse/touchscreen) enter into a computer.

Some years ago, I was working on using DFA for DNA sequence comparison.

I did, at least, do some tests with millions of states, maybe tens of
millions.  (The alphabet size is four, so you can fit a lot of states in memory.)

But yes, I didn't try to enter the states using a keyboard, mouse,
or other human interface device.

Reminds me, though, of some years ago, close to the time I was working
on that one, it was said that much of the DNA sequences in GenBank were
obtained by OCR scanning the pages of journal articles.  The fraction that
were obtained that way kept decreasing, but the number was increasing.
(Like Moore's law, the size of the GenBank increases exponentially,
last I knew at 1%/week.)

[toc] | [prev] | [next] | [standalone]

#2952

From	Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Date	2022-03-24 02:01 -0700
Message-ID	<22-03-052@comp.compilers>
In reply to	#2949

On Wednesday, March 23, 2022 at 8:24:44 PM UTC+1, Roger L Costello wrote:
> Hi Folks,
>
> On page 48 of the Flex manual [1] it says this amazing thing:
>
> Note that adding rules does not slow down the scanner! The speed of the
> scanner is independent of the number of rules or (modulo the considerations
> given at the beginning of this section) how complicated the rules are with
> regard to operators such as '*' and '|'.
>
> That is amazing! And counterintuitive. How can it possibly be that a scanner
> containing 1000 rules can operate as fast as a scanner containing 10 rules?
> Would you give some intuition to help me understand this, please?
>
> /Roger
>
> [1] https://epaperpress.com/lexandyacc/download/flex.pdf
>
> [Flex compiles the rules into a finite state machine. When the scanner
> runs, it just looks up each character it reads in the table for the current
> state to decide what to do. Creating the state tables for 1000 rules takes
> a lot longer than creating the tables for 10 rules, but that just happens
> once when you build the scanner, not when it's running.
> For more details on regular expressions and state machines, see any compiler
> textbook. It's one of the standard topics. -John]

I am not sure what answer you are expecting in respect to the complexity of
the answer, or with which viewpoint you are the most comfortable with. There
are many (more than just the 2 answers mentioned below) answers to the posed
question.

(1) Flex is performing an optimization that is similar to common subexpression
elimination (CSE). For example, conceptually, two scanner rules written using
BASIC-like notation:

10: IF in[p]='a' AND in[p+1]='b' THEN { p+=2; GOTO 20 }
10: IF in[p]='a' AND in[p+1]='c' THEN { p+=2; GOTO 30 }
....

are rewritten/converted by Flex into:

10: IF in[p]='a' THEN { p+=1; GOTO 11 }
11: IF in[p]='b' THEN { p+=1; GOTO 20 }
11: IF in[p]='c' THEN { p+=1; GOTO 30 }
....

And, as can be easily seen from the above example, the originally
non-deterministic line "10" has been converted into a single (deterministic)
line of code, which is a basic principle of how Flex is operating when
generating a scanner.

(2) Flex is old and was designed for single-threaded CPUs. It is unable to
generate a multi-threaded scanner. It is for example obvious that the language
(ab)* benefits from multi-core optimizations in case of sufficiently long
inputs. Note that this viewpoint is in direct contradiction to the
"old-school" viewpoint presented by John (in the square brackets in the text
cited above).

-atom
[Is there anything published about parallel scanning?  I'd think it'd be
inherently sequential since you don't know the state for a character
until you've processed all the previous characters. -John]

[toc] | [prev] | [next] | [standalone]

#2955 — Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules?

From	Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Date	2022-03-24 11:18 -0700
Subject	Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules?
Message-ID	<22-03-058@comp.compilers>
In reply to	#2952

On Thursday, March 24, 2022 at 6:26:21 PM UTC+1, Jan Ziak wrote:
> [Is there anything published about parallel scanning? I'd think it'd be
> inherently sequential since you don't know the state for a character
> until you've processed all the previous characters. -John]

The belief that "scanning is inherently sequential since you don't
know the state for a character until you've processed all the previous
characters" is false for most programming languages (BASIC, Go, XML,
etc) for most real-world source codes.

-atom
[Well, OK, if I'm scanning XML and break it up into chunks to scan in parallel, how
do I know whether I'm inside a CDATA block?  Or are you just saying you can do
clumps of a few characters at a time?  Again, some sort of reference rather than
just claiming it's possible would be a help. -John]

[toc] | [prev] | [next] | [standalone]

#2958 — Parallel lexers by chunking the input

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-03-24 22:32 +0200
Subject	Parallel lexers by chunking the input
Message-ID	<22-03-064@comp.compilers>
In reply to	#2955

Under certain very limited conditions you can parallelize your lexer
by chunking the inputs, but the number of cases for which that breaks
make it a rather useless endeavor.  At Intel we had Charlie Lasswell
and Michela Becchi look into the problem from two different angles.
Michela has some excellent papers about dealing with ".*" and related
constructs, especially overlapping ".{m,n}" constructs.

But, the real issue is strings, comments, and markdown like features.
They all suffer from the same issue, which is that they are generally
unbounded strings that start and stop with very short sequences.  And,
by picking some arbitrary point in the future to start scanning, you
cannot be sure whether you are inside or outside one of those
features.  This, of course, also makes for very problematic error
messages.  Leave out a single character (e.g. an opening or closing
quote) and text can be entirely misinterpreted.

One common attempt at mitigating that problem is limiting such
sequences to a single line.  However, the result of that is the need
for "raw" strings where you can
have long strings that span multiple lines.

Like

```
code goes in here
and can include `, ', and " marks where appropriate
and it doesn't end until one sees
```

of maybe one prefers

#" to delimit a raw string
   allowing embedded " and ' marks
   and maybe the processor throws away indentation within it
   and trailing whitespace at the end of a line
   but we still are looking for the terminating
   and ``` here isn't special either
"#

And, before you think that's a ridiculous example, let me tell you a
very real use of it.  I'm building a compiler and in particular unit
tests for it.  So, I need to have strings that represent significant
fragments of real programs and other strings, the "golden" results,
that capture json output of the internal structure which I cut and
paste from dumps of it as a string.  Both of these fragments
(especially the latter) can be multiple lines long and can look like
code (the first actually are code) but which I don't want parsed as
such.   A parallelized lexer could easily attempt to tokenize those
fragments if it picked a bad starting point.  That speculative lexing
would actually make the process slower.

Preston Briggs was right when he talked about regular expressions
being an "embarrassingly serial application".

In fact, I have numerous times thought about applying Boyer-Moore like
algorithms to lexing and parsing, and then remembering these cases put
it down as technically feasible but practically useless. About the
only way to use parallelism to speed up lexing is if one has separate
files being processed and rules which keep tokens from changing state
over a file boundary.  Now, that last requirement isn't as hard as it
seems, as usually "include" files can only get recognized at
token boundaries.  Still the speedup therefrom is not that useful.
There are other better solutions to that problem.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2959 — Re: Parallel lexers by chunking the input

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2022-03-25 08:04 +0000
Subject	Re: Parallel lexers by chunking the input
Message-ID	<22-03-065@comp.compilers>
In reply to	#2958

Christopher F Clark <christopher.f.clark@compiler-resources.com> writes:
>But, the real issue is strings, comments, and markdown like features.
>They all suffer from the same issue, which is that they are generally
>unbounded strings that start and stop with very short sequences.  And,
>by picking some arbitrary point in the future to start scanning, you
>cannot be sure whether you are inside or outside one of those
>features.

Yes.  But you can represent this uncertainty in the state of the DFA.
Essentially you can take the original DFA, and make all states of that
(nondeterministic) start states of our continuation DFA.  Process the
pieces of the input in parallel with the continuation automaton.

Now you know the end state of the original automaton the
first piece, and the end state of the continuation automaton of the
second piece; you can look up the end state of the original automaton
for the second piece from these informations.  And so on for the
further pieces.  This is a sequential component, but it is fast.

Now you can reprocess the second-to-last pieces in parallel using the
original automaton and performing the actions (e.g., converting
numbers into binary representation).

Not sure how to interface that to the parser, but if we have a similar
idea for parallelizing the parser, even the lex/flex interface might
be ok.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Experiments would be useful here.  Yes, you can do this, but do you
end up throwing away so much work that it's not worth it? -John]

[toc] | [prev] | [next] | [standalone]

#2960 — Re: Parallel lexers by chunking the input

From	gah4 <gah4@u.washington.edu>
Date	2022-03-25 07:58 -0700
Subject	Re: Parallel lexers by chunking the input
Message-ID	<22-03-066@comp.compilers>
In reply to	#2959

On Friday, March 25, 2022 at 4:45:41 AM UTC-7, Anton Ertl wrote:

(snip)

> Yes. But you can represent this uncertainty in the state of the DFA.
> Essentially you can take the original DFA, and make all states of that
> (nondeterministic) start states of our continuation DFA. Process the
> pieces of the input in parallel with the continuation automaton.

At some point, it is Aho-Corasick algorithm.

That works well if you are looking for a large number of things,
and you don't know where they start.  The DFA needs one bit,
in addition to the next state indicator, to indicate that you
found something.  A favorite implementation is the low bit
of a pointer on a byte addressed system, which will normally
be zero.

I suspect this might be used for malware search programs,
which look for any of a large list of matching strings though a
large file.  Not quite as useful for programming language
scanners, though.

[toc] | [prev] | [next] | [standalone]

#2961 — Re: Parallel lexers by chunking the input

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2022-03-25 17:47 +0000
Subject	Re: Parallel lexers by chunking the input
Message-ID	<22-03-067@comp.compilers>
In reply to	#2959

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
...
>[Experiments would be useful here.  Yes, you can do this, but do you
>end up throwing away so much work that it's not worth it? -John]

There is no throwing away, but there are two parallelized passes
through the input (compared to one for the sequential version), plus
the sequential component (should be comparatively small if the chunks
are large), plus maybe some additional overhead for the interface to
the parser.  Still, if you run the scanner on an 8-core CPU, I would
expect a nice real-time speedup (but CPU-time slowdown) for scanning a single input.

However, for typical C/C++ compilation jobs, there often is enough
parallelism from "make -j", so you don't want a scanner that takes
more CPU time.

For load-and-go systems, this kind of parallelization may be more
appropriate, but often there is some interleaving with semantic stuff
that may prevent having the complete scanner run before the rest, and
thus prevent parallelizing the scanner.

@gah4: I don't think that the Aho-Corasick algorithm is particularly
relevant.  It seems to be a less general variant of what lex/flex
does, and is completely sequential in any case.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[I don't understand where you get only two passes.  If you divide the input
into arbitrary sized chunks, you could potentially start at any state in the
scanner and while most of those states would quickly fail, that's a lot of
startup overhead for each block. -John]

[toc] | [prev] | [next] | [standalone]

#2962 — Parallel lexers by chunking the input.

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-03-26 00:36 +0200
Subject	Parallel lexers by chunking the input.
Message-ID	<22-03-068@comp.compilers>
In reply to	#2961

For the record, Hyperscan does do that, but I think only in NFA mode.
(That wasn't my part of the code, so I don't know the details.)
However, it does figure out states where you can scan far ahead and
then patch up.  As gah4 mentions, this works relatively well in
applications like Snort, where you are looking for regular expressions
that might start anywhere, but you mostly don't expect to
match--that's what Hyperscan is optimized for.  And, Michela Becchi's
papers show ways of doing that reliably with DFAs also.

And, yes, Anton, you can express that uncertainty, but there are
multiple factors involved.

An important one is the number of unbounded token types you have.
Those often add together combinatorially.  So, if you have two
different forms of comments, plus a set of preprocessor directives,
plus strings, you might have 64 different states you need to consider.

However, in the case that interests me (currently) which is compiler
unit tests, I have a different factor.  The amount of code compared to
the amount of unbounded tokens is small.  Each test case has about
10-20 lines of mostly boilerplate code, but within that are two
fragments of unbounded tokens which look like code but aren't, they
are the input and expected output.  The input is often 30-50 lines
(e.g. one of the TPC benchmarks) and the expected output, the json
representation of the IR, is often 100-200 lines and each of those has
an average of 5 sequences of characters that would parse as a token,
but isn't one.  So, any arbitrary starting point, is most likely in an
unbounded token but in text that looks like tokens and could easily
queue up 100s of tokens that aren't tokens.  The better solution is to
put one test per file and not try splitting the file into smaller
chunks.  That has a different set of issues, but it is better from the
lexing perspective.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2963 — Parallel lexers by chunking the input

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-03-26 01:00 +0200
Subject	Parallel lexers by chunking the input
Message-ID	<22-03-069@comp.compilers>
In reply to	#2961

The relevance of the AC algorithm is it roughly tells you how to
resync when you have found a suffix of pattern.  If I remember
correctly, Commentz-Walter is even closer to what you want.  I feel
like someone has already extended this to work with regular
expressions.  If not, it isn't that hard to work out.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2964 — Re: Parallel lexers by chunking the input

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2022-03-26 09:27 +0000
Subject	Re: Parallel lexers by chunking the input
Message-ID	<22-03-070@comp.compilers>
In reply to	#2961

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>[I don't understand where you get only two passes.  If you divide the input
>into arbitrary sized chunks, you could potentially start at any state in the
>scanner and while most of those states would quickly fail, that's a lot of
>startup overhead for each block. -John]

The first pass processes a DFA.  No startup overhead.

Here's a simple example.  For simplicity, consider an input alphabet
consisting only of [a1"]; consider the following lex specification:

"[a1]*"    return T_STRING;
a+         return T_ID;
1+         return T_NUM;

This results in the following transition table (the original state
machine); we start in state B:

current  input char
state      "  a  1
B          C  D  E
C          B  C  C
D          C  D  E
E          C  D  E

On some of these transitions, you get the scanner actions, but that's
not important for the next step:

We now want to produce a continuation automaton that can start in any
of these states and track what the resulting state of the original
automaton would be if we started with a specific one of the original
states; i.e., a state EBCD means that in the original domain we would
map B (first state) to E, C to B, D to C, and E to D; the start state
of this automaton is BCDE (i.e., the identity mapping):

current  input char
state     "    a    1
BCDE     CBCC DCDD ECEE
CBCC     BCBB CDCC CECC
DCDD     CBCC DCDD ECEE
ECEE     CBCC DCDD ECEE
BCBB     CBCC DCDD ECEE
CDCC     BCBB CDCC CECC
CECC     BCBB CDCC CECC

And we have the following mapping from start-original x
end-continuation -> end-original states:

     B C D E
BCDE B C D E
CBCC C B C C
DCDD D C D D
ECEE E C E E
BCBB B C B B
CDCC C D C C
CECC C E C C

In the first pass we use the continuation automaton.  For simplicity,
let's just work with three chunks.  For the first chunk, we know the
original start state, so we actually don't need to use the
continuation automaton, but we need it for the other chunks.  For the
last chunk, we actually don't need the end state of the continuation
automaton (because it's only needed for determining the start
original-state of the next chunk), so we only need to process the
continuation automaton for chunks 2...n-1.  We start these chunks with
the start state BCDE.  So we process the first chunk with the original
automaton (with scanner actions) on core 1 while we process the second
chunk with the continuation automaton on core 2.  In the end we get,
say, the end state C for chunk 1, and the end state CECC for chunk 2.

We can now look up the original end state of chunk 2 from these two
end states: CxCECC->E.  In general, we can continue this up to chunk
n-1.  This is the sequential step.

Now we can process (in parallel) chunk 2 and chunk 3 with the original
automaton (including scanner actions): chunk 2 starts in state C,
while chunk 3 starts in state E.

If you look at the continuation automaton, there is no startup
overhead, no NFA overhead, no need to resync.  You do have the CPU
overhead of running the continuation automaton and the sequential
component in addition to the original automaton; that's the price you
pay for parallelism.

I would be surprised if nobody has come up with this before.  But if
nobody has, maybe I should write up a paper about it.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

[toc] | [prev] | [standalone]

csiph-web

How can the speed of a scanner be independent of the number of rules?

Contents

#2949 — How can the speed of a scanner be independent of the number of rules?

#2950

#2951

#2953

#2954

#2956

#2957

#2952

#2955 — Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules?

#2958 — Parallel lexers by chunking the input

#2959 — Re: Parallel lexers by chunking the input

#2960 — Re: Parallel lexers by chunking the input

#2961 — Re: Parallel lexers by chunking the input

#2962 — Parallel lexers by chunking the input.

#2963 — Parallel lexers by chunking the input

#2964 — Re: Parallel lexers by chunking the input