Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: gah4 Newsgroups: comp.compilers Subject: Re: How can the speed of a scanner be independent of the number of rules? Date: Thu, 24 Mar 2022 13:08:20 -0700 (PDT) Organization: Compilers Central Lines: 39 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-062@comp.compilers> References: <22-03-047@comp.compilers> <22-03-048@comp.compilers> <22-03-056@comp.compilers> <22-03-059@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="47455"; mail-complaints-to="abuse@iecc.com" Keywords: lex, performance Posted-Date: 24 Mar 2022 19:18:16 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <22-03-059@comp.compilers> Xref: csiph.com comp.compilers:2957 On Thursday, March 24, 2022 at 12:19:52 PM UTC-7, Jan Ziak wrote: (snip) > That is, in my opinion, an invalid way of looking at scanning, because it > isn't computer science. A better way (I am not saying "the only right way") is > for example to compare the size of integers, the size of a memory address, > that a typical year-2022 CPU can process/access/read/write/lookup using in a > single CPU instruction. This distinction comes up more often for sorting algorithms, but I suppose it applies here, too. If one wants to sort a gigabyte of bytes, that is elements of length one (8 bit) byte, it is faster to count them, and then generate an output file with the appropriate number of each byte. Depending on how you do the measurement, that violates some rule on sorting algorithms. > The fact is that the size of those integers and > addresses (such as: 32 bits, 64 bits) is extremely large compared to the > number of rules and complexity of any grammar that a human is able to directly > (by hand, by means of keyboard/mouse/touchscreen) enter into a computer. Some years ago, I was working on using DFA for DNA sequence comparison. I did, at least, do some tests with millions of states, maybe tens of millions. (The alphabet size is four, so you can fit a lot of states in memory.) But yes, I didn't try to enter the states using a keyboard, mouse, or other human interface device. Reminds me, though, of some years ago, close to the time I was working on that one, it was said that much of the DNA sequences in GenBank were obtained by OCR scanning the pages of journal articles. The fraction that were obtained that way kept decreasing, but the number was increasing. (Like Moore's law, the size of the GenBank increases exponentially, last I knew at 1%/week.)