Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Parallel lexers by chunking the input Date: Thu, 24 Mar 2022 22:32:12 +0200 Organization: Compilers Central Lines: 71 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-064@comp.compilers> References: <22-03-058@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48055"; mail-complaints-to="abuse@iecc.com" Keywords: lex, parallel Posted-Date: 24 Mar 2022 19:20:45 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2958 Under certain very limited conditions you can parallelize your lexer by chunking the inputs, but the number of cases for which that breaks make it a rather useless endeavor. At Intel we had Charlie Lasswell and Michela Becchi look into the problem from two different angles. Michela has some excellent papers about dealing with ".*" and related constructs, especially overlapping ".{m,n}" constructs. But, the real issue is strings, comments, and markdown like features. They all suffer from the same issue, which is that they are generally unbounded strings that start and stop with very short sequences. And, by picking some arbitrary point in the future to start scanning, you cannot be sure whether you are inside or outside one of those features. This, of course, also makes for very problematic error messages. Leave out a single character (e.g. an opening or closing quote) and text can be entirely misinterpreted. One common attempt at mitigating that problem is limiting such sequences to a single line. However, the result of that is the need for "raw" strings where you can have long strings that span multiple lines. Like ``` code goes in here and can include `, ', and " marks where appropriate and it doesn't end until one sees ``` of maybe one prefers #" to delimit a raw string allowing embedded " and ' marks and maybe the processor throws away indentation within it and trailing whitespace at the end of a line but we still are looking for the terminating and ``` here isn't special either "# And, before you think that's a ridiculous example, let me tell you a very real use of it. I'm building a compiler and in particular unit tests for it. So, I need to have strings that represent significant fragments of real programs and other strings, the "golden" results, that capture json output of the internal structure which I cut and paste from dumps of it as a string. Both of these fragments (especially the latter) can be multiple lines long and can look like code (the first actually are code) but which I don't want parsed as such. A parallelized lexer could easily attempt to tokenize those fragments if it picked a bad starting point. That speculative lexing would actually make the process slower. Preston Briggs was right when he talked about regular expressions being an "embarrassingly serial application". In fact, I have numerous times thought about applying Boyer-Moore like algorithms to lexing and parsing, and then remembering these cases put it down as technically feasible but practically useless. About the only way to use parallelism to speed up lexing is if one has separate files being processed and rules which keep tokens from changing state over a file boundary. Now, that last requirement isn't as hard as it seems, as usually "include" files can only get recognized at token boundaries. Still the speedup therefrom is not that useful. There are other better solutions to that problem. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------