Parallel lexers by chunking the input

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Parallel lexers by chunking the input
Date	2022-03-24 22:32 +0200
Organization	Compilers Central
Message-ID	<22-03-064@comp.compilers> (permalink)
References	<22-03-058@comp.compilers>

Show all headers | View raw

Under certain very limited conditions you can parallelize your lexer
by chunking the inputs, but the number of cases for which that breaks
make it a rather useless endeavor.  At Intel we had Charlie Lasswell
and Michela Becchi look into the problem from two different angles.
Michela has some excellent papers about dealing with ".*" and related
constructs, especially overlapping ".{m,n}" constructs.

But, the real issue is strings, comments, and markdown like features.
They all suffer from the same issue, which is that they are generally
unbounded strings that start and stop with very short sequences.  And,
by picking some arbitrary point in the future to start scanning, you
cannot be sure whether you are inside or outside one of those
features.  This, of course, also makes for very problematic error
messages.  Leave out a single character (e.g. an opening or closing
quote) and text can be entirely misinterpreted.

One common attempt at mitigating that problem is limiting such
sequences to a single line.  However, the result of that is the need
for "raw" strings where you can
have long strings that span multiple lines.

Like

```
code goes in here
and can include `, ', and " marks where appropriate
and it doesn't end until one sees
```

of maybe one prefers

#" to delimit a raw string
   allowing embedded " and ' marks
   and maybe the processor throws away indentation within it
   and trailing whitespace at the end of a line
   but we still are looking for the terminating
   and ``` here isn't special either
"#

And, before you think that's a ridiculous example, let me tell you a
very real use of it.  I'm building a compiler and in particular unit
tests for it.  So, I need to have strings that represent significant
fragments of real programs and other strings, the "golden" results,
that capture json output of the internal structure which I cut and
paste from dumps of it as a string.  Both of these fragments
(especially the latter) can be multiple lines long and can look like
code (the first actually are code) but which I don't want parsed as
such.   A parallelized lexer could easily attempt to tokenize those
fragments if it picked a bad starting point.  That speculative lexing
would actually make the process slower.

Preston Briggs was right when he talked about regular expressions
being an "embarrassingly serial application".

In fact, I have numerous times thought about applying Boyer-Moore like
algorithms to lexing and parsing, and then remembering these cases put
it down as technically feasible but practically useless. About the
only way to use parallelism to speed up lexing is if one has separate
files being processed and rules which keep tokens from changing state
over a file boundary.  Now, that last requirement isn't as hard as it
seems, as usually "include" files can only get recognized at
token boundaries.  Still the speedup therefrom is not that useful.
There are other better solutions to that problem.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-23 18:49 +0000
  Re: How can the speed of a scanner be independent of the number of rules? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-23 21:27 +0000
    Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-23 19:57 -0700
    Re: How can the speed of a scanner be independent of the number of rules? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 13:12 +0200
    Re: How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-24 11:53 +0000
      Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:56 -0700
        Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-24 13:08 -0700
  Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 02:01 -0700
    Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:18 -0700
      Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 22:32 +0200
        Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 08:04 +0000
          Re: Parallel lexers by chunking the input gah4 <gah4@u.washington.edu> - 2022-03-25 07:58 -0700
          Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 17:47 +0000
            Parallel lexers by chunking the input. Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 00:36 +0200
            Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 01:00 +0200
            Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-26 09:27 +0000

csiph-web