Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2962
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Parallel lexers by chunking the input. |
| Date | 2022-03-26 00:36 +0200 |
| Organization | Compilers Central |
| Message-ID | <22-03-068@comp.compilers> (permalink) |
| References | <22-03-058@comp.compilers> <22-03-064@comp.compilers> <22-03-065@comp.compilers> <22-03-067@comp.compilers> |
For the record, Hyperscan does do that, but I think only in NFA mode. (That wasn't my part of the code, so I don't know the details.) However, it does figure out states where you can scan far ahead and then patch up. As gah4 mentions, this works relatively well in applications like Snort, where you are looking for regular expressions that might start anywhere, but you mostly don't expect to match--that's what Hyperscan is optimized for. And, Michela Becchi's papers show ways of doing that reliably with DFAs also. And, yes, Anton, you can express that uncertainty, but there are multiple factors involved. An important one is the number of unbounded token types you have. Those often add together combinatorially. So, if you have two different forms of comments, plus a set of preprocessor directives, plus strings, you might have 64 different states you need to consider. However, in the case that interests me (currently) which is compiler unit tests, I have a different factor. The amount of code compared to the amount of unbounded tokens is small. Each test case has about 10-20 lines of mostly boilerplate code, but within that are two fragments of unbounded tokens which look like code but aren't, they are the input and expected output. The input is often 30-50 lines (e.g. one of the TPC benchmarks) and the expected output, the json representation of the IR, is often 100-200 lines and each of those has an average of 5 sequences of characters that would parse as a token, but isn't one. So, any arbitrary starting point, is most likely in an unbounded token but in text that looks like tokens and could easily queue up 100s of tokens that aren't tokens. The better solution is to put one test per file and not try splitting the file into smaller chunks. That has a different set of issues, but it is better from the lexing perspective. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-23 18:49 +0000
Re: How can the speed of a scanner be independent of the number of rules? Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-23 21:27 +0000
Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-23 19:57 -0700
Re: How can the speed of a scanner be independent of the number of rules? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 13:12 +0200
Re: How can the speed of a scanner be independent of the number of rules? Roger L Costello <costello@mitre.org> - 2022-03-24 11:53 +0000
Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:56 -0700
Re: How can the speed of a scanner be independent of the number of rules? gah4 <gah4@u.washington.edu> - 2022-03-24 13:08 -0700
Re: How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 02:01 -0700
Re: Parallel scanning, was How can the speed of a scanner be independent of the number of rules? Jan Ziak <0xe2.0x9a.0x9b@gmail.com> - 2022-03-24 11:18 -0700
Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-24 22:32 +0200
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 08:04 +0000
Re: Parallel lexers by chunking the input gah4 <gah4@u.washington.edu> - 2022-03-25 07:58 -0700
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-25 17:47 +0000
Parallel lexers by chunking the input. Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 00:36 +0200
Parallel lexers by chunking the input Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 01:00 +0200
Re: Parallel lexers by chunking the input anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2022-03-26 09:27 +0000
csiph-web