Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Subject: Re: Parallel lexers by chunking the input
Date: Sat, 26 Mar 2022 09:27:05 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 92
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-070@comp.compilers>
References: <22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="54981"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parallel
Posted-Date: 26 Mar 2022 10:16:26 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2964

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>[I don't understand where you get only two passes.  If you divide the input
>into arbitrary sized chunks, you could potentially start at any state in the
>scanner and while most of those states would quickly fail, that's a lot of
>startup overhead for each block. -John]

The first pass processes a DFA.  No startup overhead.

Here's a simple example.  For simplicity, consider an input alphabet
consisting only of [a1"]; consider the following lex specification:

"[a1]*"    return T_STRING;
a+         return T_ID;
1+         return T_NUM;

This results in the following transition table (the original state
machine); we start in state B:

current  input char
state      "  a  1
B          C  D  E
C          B  C  C
D          C  D  E
E          C  D  E

On some of these transitions, you get the scanner actions, but that's
not important for the next step:

We now want to produce a continuation automaton that can start in any
of these states and track what the resulting state of the original
automaton would be if we started with a specific one of the original
states; i.e., a state EBCD means that in the original domain we would
map B (first state) to E, C to B, D to C, and E to D; the start state
of this automaton is BCDE (i.e., the identity mapping):

current  input char
state     "    a    1
BCDE     CBCC DCDD ECEE
CBCC     BCBB CDCC CECC
DCDD     CBCC DCDD ECEE
ECEE     CBCC DCDD ECEE
BCBB     CBCC DCDD ECEE
CDCC     BCBB CDCC CECC
CECC     BCBB CDCC CECC

And we have the following mapping from start-original x
end-continuation -> end-original states:

     B C D E
BCDE B C D E
CBCC C B C C
DCDD D C D D
ECEE E C E E
BCBB B C B B
CDCC C D C C
CECC C E C C

In the first pass we use the continuation automaton.  For simplicity,
let's just work with three chunks.  For the first chunk, we know the
original start state, so we actually don't need to use the
continuation automaton, but we need it for the other chunks.  For the
last chunk, we actually don't need the end state of the continuation
automaton (because it's only needed for determining the start
original-state of the next chunk), so we only need to process the
continuation automaton for chunks 2...n-1.  We start these chunks with
the start state BCDE.  So we process the first chunk with the original
automaton (with scanner actions) on core 1 while we process the second
chunk with the continuation automaton on core 2.  In the end we get,
say, the end state C for chunk 1, and the end state CECC for chunk 2.

We can now look up the original end state of chunk 2 from these two
end states: CxCECC->E.  In general, we can continue this up to chunk
n-1.  This is the sequential step.

Now we can process (in parallel) chunk 2 and chunk 3 with the original
automaton (including scanner actions): chunk 2 starts in state C,
while chunk 3 starts in state E.

If you look at the continuation automaton, there is no startup
overhead, no NFA overhead, no need to resync.  You do have the CPU
overhead of running the continuation automaton and the sequential
component in addition to the original automaton; that's the price you
pay for parallelism.

I would be surprised if nobody has come up with this before.  But if
nobody has, maybe I should write up a paper about it.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/