Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <480-992-1380@kylheku.com> Newsgroups: comp.compilers Subject: Re: State-of-the-art algorithms for lexical analysis? Date: Mon, 6 Jun 2022 16:00:37 -0000 (UTC) Organization: A noiseless patient Spider Lines: 139 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-06-010@comp.compilers> References: <22-06-006@comp.compilers> <22-06-007@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="76736"; mail-complaints-to="abuse@iecc.com" Keywords: lex, comment Posted-Date: 06 Jun 2022 15:55:55 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:3048 On 2022-06-05, gah4 wrote: > On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote: > > (snip) > >> Are regular expressions still the best way to specify tokens? > > Some years ago, I used to work with a company that sold hardware > search processors to a certain three letter agency that we are not > supposed to mention, but everyone knows. > > It has a completely different PSL, Pattern Specification Language, > much more powerful than the usual regular expression. > > Both the standard and extended regular expression are nice, in that we > get used to using them, especially with grep, and without thinking too > much about them. > > I suspect, though, that if they hadn't previously been defined, we > might come up with something different today. Whether or not regexes are defined: - we would still have have the concept of a machine with a finite number of states. - the result would hold that a machine with a finite number of states can only recognize certain sets of strings (what we call "regular languages"), and that those sets can be infinite. - the observation would still be made that those sets of strings have certain features, like expressing certain kinds of repetitions, but not other repetitive patterns such as: - an arbitrary number of N parentheses followed by a N closed ones, for any N. - obvious compressed notations would suggest themselves for expressing the features of those sets. - someone would dedicate him or herself toward finding the minimal set of useful operations in the notation which can capture all such sets (e.g. the same process by which we know that ? and + are not necessary if we have the Kleene * and branching because A+ is just AA*, and A? is (A|). The Kleene star and branching would surely be rediscovered. We would end up with regex under a different name, using different notations: maybe some other symbol instead of star, perhaps in a different position, like prefix instead of suffix, or whatever. > Among others, PSL has the ability to define approximate matches, > such as a word with one or more misspellings, that is insertions, > deletions, or substitutions. Usual RE don't have that ability. This may be great for some explorative programming, but doesn't do much when you're writing a compiler for a very specific, defined language. Programmers misspell not only the fixed tokens of a language, but also program-defined identifiers like function names, variables, and types. Today, when a C compiler says "undeclared identifier `pintf`, did you mean `printf`?", this is not based on some misspelling support in the lexical analyzer, and could not reasonably be. First the error is identified in the ordinary way, and then some algorithm that is entirely external to parsing is applied to the symbol tables to find identifiers similar to the undeclared one. > There are also PSL expressions for ranges of numbers. > You can often do that with very complicated RE, considering > all of the possibilities. PSL automatically processes those > possibilities. (Some can expand to complicated code.) But ranges of numbers are regular sets. You can have a macro operator embedded in a regex language whuch generates that same code. For instance for matching the decimal strings 27 to 993, there is a regex, and a way of calculating that regex. We know thre is a regex because the strings set{ "27", "28", ..., "993" } is a regular set, being finite. We can form a regex just by combining those elements with a | branch operator. We can do something which condenses some of the redundancy like: 9(9(|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7 |6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8 |7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9 |8|7|6|5|4|3|2|1|0))|8(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0) |7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1| 0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2| 1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|7(9(|9|8|7|6|5|4 |3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5 |4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6 |5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7 |6|5|4|3|2|1|0))|6(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(| 9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4 (|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0) |1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|5(9(|9|8|7|6|5|4|3|2 |1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3 |2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4 |3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5 |4|3|2|1|0))|4(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8| 7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9| 8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(| 9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|3(9(|9|8|7|6|5|4|3|2|1|0 )|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1 |0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2 |1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3 |2|1|0))|2(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6| 5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5|4|3|2|1|0)|4(9|8|7|6|5 |4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|4|3|2|1|0)|1(9|8|7|6|5| 4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))|1(9(9|8|7|6|5|4|3|2|1|0)|8(9|8|7|6| 5|4|3|2|1|0)|7(9|8|7|6|5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5 |4|3|2|1|0)|4(9|8|7|6|5|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5| 4|3|2|1|0)|1(9|8|7|6|5|4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0)) where we can better notate sequences like (9|8|7|6|5|4|3|2|1|0) as [0-9]. What I did there was turn these things into a trie, and then just transliterated that trie into regex syntax. (The digits appear in reverse because the trie implementation I'm using relies on hash tables, and hash tables don't have a specified order; the actual order observed as an artifact of the hashing function. In modern systems that function can be perturbed by a randomly generated key for thwarting hash table attacks.) Anyway, that sort of thing being what it is, the mechanism for generating it thing can be readily embedded as a syntactic sugar into a regex language, without making it non-regular in any way. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal [To put it another way, the set of strings you can recoginize with a NFA or DFA is the same as the set of strings you can describe with a regex. A DFA is such an obvious thing that we would have reverse engineered regexes from them if Ken Thompson hadn't done it the other way. -John]