Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Parallel lexers by chunking the input. Date: Sat, 26 Mar 2022 00:36:40 +0200 Organization: Compilers Central Lines: 40 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-068@comp.compilers> References: <22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="55327"; mail-complaints-to="abuse@iecc.com" Keywords: lex, performance, parallel Posted-Date: 25 Mar 2022 18:51:59 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2962 For the record, Hyperscan does do that, but I think only in NFA mode. (That wasn't my part of the code, so I don't know the details.) However, it does figure out states where you can scan far ahead and then patch up. As gah4 mentions, this works relatively well in applications like Snort, where you are looking for regular expressions that might start anywhere, but you mostly don't expect to match--that's what Hyperscan is optimized for. And, Michela Becchi's papers show ways of doing that reliably with DFAs also. And, yes, Anton, you can express that uncertainty, but there are multiple factors involved. An important one is the number of unbounded token types you have. Those often add together combinatorially. So, if you have two different forms of comments, plus a set of preprocessor directives, plus strings, you might have 64 different states you need to consider. However, in the case that interests me (currently) which is compiler unit tests, I have a different factor. The amount of code compared to the amount of unbounded tokens is small. Each test case has about 10-20 lines of mostly boilerplate code, but within that are two fragments of unbounded tokens which look like code but aren't, they are the input and expected output. The input is often 30-50 lines (e.g. one of the TPC benchmarks) and the expected output, the json representation of the IR, is often 100-200 lines and each of those has an average of 5 sequences of characters that would parse as a token, but isn't one. So, any arbitrary starting point, is most likely in an unbounded token but in text that looks like tokens and could easily queue up 100s of tokens that aren't tokens. The better solution is to put one test per file and not try splitting the file into smaller chunks. That has a different set of issues, but it is better from the lexing perspective. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------