Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Parsing using a Graphics Processing Unit (GPU)?
Date: Thu, 3 Sep 2020 17:52:22 +0300
Organization: Compilers Central
Lines: 65
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-09-013@comp.compilers>
References: <CAJj6eyXKHQx2V9H6pU8rNU_RPgGfFZS3qN1ikvBJ6oo9ewJvLQ@mail.gmail.com> <CAJj6eyW4TLsCdq34B1T38YkOkr-8fb7DQw2M-T7PXRdNNQrDPA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="70898"; mail-complaints-to="abuse@iecc.com"
Keywords: parallel, lex
Posted-Date: 04 Sep 2020 21:43:31 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <CAJj6eyW4TLsCdq34B1T38YkOkr-8fb7DQw2M-T7PXRdNNQrDPA@mail.gmail.com>
Xref: csiph.com comp.compilers:2584

I hate getting overly involved in this, but not only is GPU lexing
possible, it probably isn't that complicated.

To make this practical, let's assume we are lexing a language like C
and we have split our file into arbitrary chunks (say 1000 bytes) and
are trying to process them in parallel.

First, note that while the preprocessor can do some pretty complicated
things to the output token stream, it doesn't have that impact at the
lexical level, the level of turning things into "raw" tokens.  You
cannot write a macro that introduces a quote character or a comment
delimiter. And those are your main issues in terms of lexing.  Yes,
you have to do symbol table lookup (and macro expansion) as a
post-step, but you aren't doing that on a character-by-character
basis.

In fact, you basically have only a few different "states" to worry about.

1) you are lexing normally, not in one of the special cases that follows.
2) you are in a string, i.e. you have seen a quote character and until
you see the matching one, you are not tokenizing at all.  There are
two quote characters so call them states 2a and 2b.
3) you have seen a C comment start /* and are looking for */.  Again,
you are not tokenizing.
4) you have seen a C++ comment start // and are looking for a newline.
Not tokenizing.
5) you have seen a # (preceded only with whitespace) and are looking
for a newline.  You are tokenizing, but the line is a preprocessor
directive.
6) you are in code that is in a #if and will be skipped for that reason.
7) you are at the start of your buffer and the previous character might be a \

So, given those states, assume that your chunk starts in state 1 and
lex like that, but saving the contents as a string in case of state 2.
On a quote, keep going but reverse the text contexts.  Similar idea
for comment marks and newlines and hash characters.  You only have to
deal with state 7 for the first character and if you consider the
first character part of the previous chunk, just that you get to
inspect, you can probably finesse that issue.

For each chunk, you have roughly two interesting sequences of possible
tokens, whether you were in a string or not.  And you can easily patch
those together because your left context always tells you which state
you enter a chunk in.

Now, while I did this hand-waving analysis for C.  Most languages are
not significantly more complex at the lexical level.  Basically, you
are tokenizing or you are collecting a string or the text will be
entirely thrown away, and you mostly care about the first two cases
(tokenizing or building a string and even building a string is easy
(although can use memory) so you really only care about the tokenizing
case.

So build a SIMD lexer using techniques based upon Wu Manber (and
Glushkov NFAs) and break your text into chunks and let the GPU have a
field day.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------