Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Parsing using a Graphics Processing Unit (GPU)? Date: Thu, 3 Sep 2020 17:52:22 +0300 Organization: Compilers Central Lines: 65 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <20-09-013@comp.compilers> References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="70898"; mail-complaints-to="abuse@iecc.com" Keywords: parallel, lex Posted-Date: 04 Sep 2020 21:43:31 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: Xref: csiph.com comp.compilers:2584 I hate getting overly involved in this, but not only is GPU lexing possible, it probably isn't that complicated. To make this practical, let's assume we are lexing a language like C and we have split our file into arbitrary chunks (say 1000 bytes) and are trying to process them in parallel. First, note that while the preprocessor can do some pretty complicated things to the output token stream, it doesn't have that impact at the lexical level, the level of turning things into "raw" tokens. You cannot write a macro that introduces a quote character or a comment delimiter. And those are your main issues in terms of lexing. Yes, you have to do symbol table lookup (and macro expansion) as a post-step, but you aren't doing that on a character-by-character basis. In fact, you basically have only a few different "states" to worry about. 1) you are lexing normally, not in one of the special cases that follows. 2) you are in a string, i.e. you have seen a quote character and until you see the matching one, you are not tokenizing at all. There are two quote characters so call them states 2a and 2b. 3) you have seen a C comment start /* and are looking for */. Again, you are not tokenizing. 4) you have seen a C++ comment start // and are looking for a newline. Not tokenizing. 5) you have seen a # (preceded only with whitespace) and are looking for a newline. You are tokenizing, but the line is a preprocessor directive. 6) you are in code that is in a #if and will be skipped for that reason. 7) you are at the start of your buffer and the previous character might be a \ So, given those states, assume that your chunk starts in state 1 and lex like that, but saving the contents as a string in case of state 2. On a quote, keep going but reverse the text contexts. Similar idea for comment marks and newlines and hash characters. You only have to deal with state 7 for the first character and if you consider the first character part of the previous chunk, just that you get to inspect, you can probably finesse that issue. For each chunk, you have roughly two interesting sequences of possible tokens, whether you were in a string or not. And you can easily patch those together because your left context always tells you which state you enter a chunk in. Now, while I did this hand-waving analysis for C. Most languages are not significantly more complex at the lexical level. Basically, you are tokenizing or you are collecting a string or the text will be entirely thrown away, and you mostly care about the first two cases (tokenizing or building a string and even building a string is easy (although can use memory) so you really only care about the tokenizing case. So build a SIMD lexer using techniques based upon Wu Manber (and Glushkov NFAs) and break your text into chunks and let the GPU have a field day. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------