Parallel lexers by chunking the input.

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Parallel lexers by chunking the input.
Date	2022-03-26 00:36 +0200
Organization	Compilers Central
Message-ID	<22-03-068@comp.compilers> (permalink)
References	<22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers>

Show all headers | View raw

For the record, Hyperscan does do that, but I think only in NFA mode.
(That wasn't my part of the code, so I don't know the details.)
However, it does figure out states where you can scan far ahead and
then patch up.  As gah4 mentions, this works relatively well in
applications like Snort, where you are looking for regular expressions
that might start anywhere, but you mostly don't expect to
match--that's what Hyperscan is optimized for.  And, Michela Becchi's
papers show ways of doing that reliably with DFAs also.

And, yes, Anton, you can express that uncertainty, but there are
multiple factors involved.

An important one is the number of unbounded token types you have.
Those often add together combinatorially.  So, if you have two
different forms of comments, plus a set of preprocessor directives,
plus strings, you might have 64 different states you need to consider.

However, in the case that interests me (currently) which is compiler
unit tests, I have a different factor.  The amount of code compared to
the amount of unbounded tokens is small.  Each test case has about
10-20 lines of mostly boilerplate code, but within that are two
fragments of unbounded tokens which look like code but aren't, they
are the input and expected output.  The input is often 30-50 lines
(e.g. one of the TPC benchmarks) and the expected output, the json
representation of the IR, is often 100-200 lines and each of those has
an average of 5 sequences of characters that would parse as a token,
but isn't one.  So, any arbitrary starting point, is most likely in an
unbounded token but in text that looks like tokens and could easily
queue up 100s of tokens that aren't tokens.  The better solution is to
put one test per file and not try splitting the file into smaller
chunks.  That has a different set of issues, but it is better from the
lexing perspective.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-23 18:49 +0000
  Re: How can the speed of a scanner be independent of the number of rules? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-23 21:27 +0000
    Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-23 19:57 -0700
    Re: How can the speed of a scanner be independent of the number of rules? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 13:12 +0200
    Re: How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-24 11:53 +0000
      Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:56 -0700
        Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-24 13:08 -0700
  Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 02:01 -0700
    Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:18 -0700
      Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 22:32 +0200
        Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 08:04 +0000
          Re: Parallel lexers by chunking the input gah4 <gah4@u.washington.edu> - 2022-03-25 07:58 -0700
          Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 17:47 +0000
            Parallel lexers by chunking the input. Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 00:36 +0200
            Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 01:00 +0200
            Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-26 09:27 +0000

csiph-web