Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.compilers Subject: Re: Parallel lexers by chunking the input Date: Sat, 26 Mar 2022 09:27:05 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 92 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-070@comp.compilers> References: <22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="54981"; mail-complaints-to="abuse@iecc.com" Keywords: lex, parallel Posted-Date: 26 Mar 2022 10:16:26 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2964 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >[I don't understand where you get only two passes. If you divide the input >into arbitrary sized chunks, you could potentially start at any state in the >scanner and while most of those states would quickly fail, that's a lot of >startup overhead for each block. -John] The first pass processes a DFA. No startup overhead. Here's a simple example. For simplicity, consider an input alphabet consisting only of [a1"]; consider the following lex specification: "[a1]*" return T_STRING; a+ return T_ID; 1+ return T_NUM; This results in the following transition table (the original state machine); we start in state B: current input char state " a 1 B C D E C B C C D C D E E C D E On some of these transitions, you get the scanner actions, but that's not important for the next step: We now want to produce a continuation automaton that can start in any of these states and track what the resulting state of the original automaton would be if we started with a specific one of the original states; i.e., a state EBCD means that in the original domain we would map B (first state) to E, C to B, D to C, and E to D; the start state of this automaton is BCDE (i.e., the identity mapping): current input char state " a 1 BCDE CBCC DCDD ECEE CBCC BCBB CDCC CECC DCDD CBCC DCDD ECEE ECEE CBCC DCDD ECEE BCBB CBCC DCDD ECEE CDCC BCBB CDCC CECC CECC BCBB CDCC CECC And we have the following mapping from start-original x end-continuation -> end-original states: B C D E BCDE B C D E CBCC C B C C DCDD D C D D ECEE E C E E BCBB B C B B CDCC C D C C CECC C E C C In the first pass we use the continuation automaton. For simplicity, let's just work with three chunks. For the first chunk, we know the original start state, so we actually don't need to use the continuation automaton, but we need it for the other chunks. For the last chunk, we actually don't need the end state of the continuation automaton (because it's only needed for determining the start original-state of the next chunk), so we only need to process the continuation automaton for chunks 2...n-1. We start these chunks with the start state BCDE. So we process the first chunk with the original automaton (with scanner actions) on core 1 while we process the second chunk with the continuation automaton on core 2. In the end we get, say, the end state C for chunk 1, and the end state CECC for chunk 2. We can now look up the original end state of chunk 2 from these two end states: CxCECC->E. In general, we can continue this up to chunk n-1. This is the sequential step. Now we can process (in parallel) chunk 2 and chunk 3 with the original automaton (including scanner actions): chunk 2 starts in state C, while chunk 3 starts in state E. If you look at the continuation automaton, there is no startup overhead, no NFA overhead, no need to resync. You do have the CPU overhead of running the continuation automaton and the sequential component in addition to the original automaton; that's the price you pay for parallelism. I would be surprised if nobody has come up with this before. But if nobody has, maybe I should write up a paper about it. - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/