Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: How can the speed of a scanner be independent of the number of rules? Date: Thu, 24 Mar 2022 13:12:20 +0200 Organization: Compilers Central Lines: 135 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-054@comp.compilers> References: <22-03-047@comp.compilers> <22-03-048@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="69235"; mail-complaints-to="abuse@iecc.com" Keywords: lex, performance Posted-Date: 24 Mar 2022 13:29:43 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2953 I see that Kaz Kylheku has written a thought out and well-reasoned response. However, it isn't based upon actual analysis of data and uses a strawman that I reject having done (while at Intel) such an analysis while designing a register expression acceleration chip. So, let me first deal with the strawman. > None of these rules take precedence over the 10 rules, and so the input is handled by the same rules. That assumption is not necessary. Even if the input is distributed over all 1000 rules, the performance will still be substantially the same. This is due to the use of a DFA. In a DFA, the speed is constant because the same code handles each state, roughly something like this. s := initial state; forever() do { c := new character from input; s := new state[c]; if (s == end of token) return token; token :+= c; } On most modern computers, the cost is dominated by the cost to fetch (via an indirect/indexed access) the next state. Now, cache effects can impact this, but most modern caches are large enough that this is a negligible effect. Moreover, it is possible to consider the case where all cache probes are misses and it goes to memory. Although, that is significantly slower, it is still constant, and more importantly the ratio of cache hits to misses is essentially also a constant ratio. Only if you are increasing the size of the DFA to the point where it starts approaching (or exceeding) the cache size does this become an issue. That would be the "knee" Kaz mentioned. 1000 tokens is usually not sufficient to do that. As I mentioned we measured this at Intel, not only for programming lexers but also for Snort, where the number of regular expressions is much larger. We even looked at ClamAV where we considered making our regular expression engine into a tool to calculate the "signatures" (essentially performing a cryptographic hash). Moreover, this goes to the argument that hashing algorithms are essentially O(1). Now, theory and practice don't always precisely line up, so in a real application, one would benchmark to verify this, but over a large range of regular expression sets, I have found it to hold, having done that benchmarking. This leads to another experiment we did at Intel as part of the Hyperscan software effort. Hyperscan is generally a Glushkov inspired NFA engine, but when possible (i.e. the DFAs are small enough) it does subset construction to turn them into DFAs. However, when the DFAs are too large, the construction fails and it falls back to the NFA engine. Because DFAs can be significantly faster than NFAs, we wanted an algorithm that allowed us to split an NFA into a small set of parallel DFAs (8 or fewer). To be practical though, we did not want the overhead of threads running on multiple cores accessing the same input data. That would in many cases have completely eroded the DFA performance boost. So, I was tasked with finding a way of running 8 threads in one thread, by interleaving the code of them. From the above code and analysis, we know the issue is that next state fetch, and dependencies of it on the statements before and after it. Even with a cache hit, this is longer than the instruction pipeline. So, the answer was to move the dependencies away from each other allowing the fetch to occur and doing other work in the pipeline in the meantime. With the help of VTune this was achieved. We got our 8 parallel DFAs to operate in the same time, it took to execute a single DFA. Note that this was achieved without the use of any assembly language coding. Our code was still portable C code with no compiler specific features. The last point I want to make in this regard is that in most cases, regular expressions, tend to have a "bushy" part where the code fans out to a variety of states, but much larger "skinny" parts, where generally only a few (usually only 1, following a variation on Zipf's distribution law) possible out going transitions. This is particularly easy to see if you are doing something like the Aho-Corasick algorithm, where the fanout is usually concentrated in the first few (often 2 or 3) states at the head of the algorithm, the initial state and its successors. The regular expression hardware we built was actually optimized for this property, with a fast dispatch for the head states and then threads for the tail states. (If you can read patent-ese language, you can tease out the details. I found that challenging despite being the original author of the idea.) -------- Now, having hopefully covered most of the reasons why with a DFA, the worry is mostly unfounded. I want to go over the other case. Particularly as it applies to mistakes one can make. This most often occurs in hand-written parsers and or lexers. First, I want to note that the LR(0) machine construction algorithm is essentially a variation on the subset construction algorithm. That means that your LR machine essentially has a DFA inside it. This is scary to many people as they aren't used to reading state machines, this is compounded when the tables are "compressed". However, despite that the same effect holds. The running time of an LR parser is still generally linear in the input length and each state transition is O(1). Aside: with GLR parsers and parse forests, this guarantee is somewhat eroded, but it is usually still acceptable as it is often only eroded over short ambiguous stretches. And here is where my understanding of the theory reaches its boundary. I think in the worst case, it still has O(n**3) complexity based upon a German theory paper I attempted unsuccessfully to read completely. Naively, it seems that the bound should be n**2, but I cannot do the calculations to prove that and the paper seemed to have good evidence for its bound, i.e. that the forests could grow at n**2 at each new token, rather than just n. So, how does a naive implementation break this? Many recursive descent parsers don't use a computed goto (case statement with an integer selector) to do the dispatch. Instead, they use a series of if statements. The result is that you have now linearized the lookup and you need an average of n/2 comparisons to find the next state. That means the next state lookup is no longer O(1), but O(n). You sometimes also see this in keyword lookup. They do so by comparing the text to a sequence of strings to determine whether the token is a keyword or not. Again, you have introduced an O(n) factor by serializing the comparison. Now, if the list is not long, this factor isn't sufficient to impact the total performance, but it is performance left on-the-table. The sequence of ifs may be easier to read than a state machine, but that is their only advantage. Thus, if you want to speed your keyword lookup up, use a simple hashing algorithm so that the hash cost is low, but then you have only a limited number of keywords (in each bucket) to compare to. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------