Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Parallel lexers by chunking the input.
Date: Sat, 26 Mar 2022 00:36:40 +0200
Organization: Compilers Central
Lines: 40
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-068@comp.compilers>
References: <22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="55327"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance, parallel
Posted-Date: 25 Mar 2022 18:51:59 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2962

For the record, Hyperscan does do that, but I think only in NFA mode.
(That wasn't my part of the code, so I don't know the details.)
However, it does figure out states where you can scan far ahead and
then patch up.  As gah4 mentions, this works relatively well in
applications like Snort, where you are looking for regular expressions
that might start anywhere, but you mostly don't expect to
match--that's what Hyperscan is optimized for.  And, Michela Becchi's
papers show ways of doing that reliably with DFAs also.

And, yes, Anton, you can express that uncertainty, but there are
multiple factors involved.

An important one is the number of unbounded token types you have.
Those often add together combinatorially.  So, if you have two
different forms of comments, plus a set of preprocessor directives,
plus strings, you might have 64 different states you need to consider.

However, in the case that interests me (currently) which is compiler
unit tests, I have a different factor.  The amount of code compared to
the amount of unbounded tokens is small.  Each test case has about
10-20 lines of mostly boilerplate code, but within that are two
fragments of unbounded tokens which look like code but aren't, they
are the input and expected output.  The input is often 30-50 lines
(e.g. one of the TPC benchmarks) and the expected output, the json
representation of the IR, is often 100-200 lines and each of those has
an average of 5 sequences of characters that would parse as a token,
but isn't one.  So, any arbitrary starting point, is most likely in an
unbounded token but in text that looks like tokens and could easily
queue up 100s of tokens that aren't tokens.  The better solution is to
put one test per file and not try splitting the file into smaller
chunks.  That has a different set of issues, but it is better from the
lexing perspective.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------