Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: State-of-the-art algorithms for lexical analysis?
Date: Mon, 6 Jun 2022 16:00:37 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 139
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-06-010@comp.compilers>
References: <22-06-006@comp.compilers> <22-06-007@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="76736"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, comment
Posted-Date: 06 Jun 2022 15:55:55 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:3048

On 2022-06-05, gah4 <gah4@u.washington.edu> wrote:
> On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:
>
> (snip)
>
>> Are regular expressions still the best way to specify tokens?
>
> Some years ago, I used to work with a company that sold hardware
> search processors to a certain three letter agency that we are not
> supposed to mention, but everyone knows.
>
> It has a completely different PSL, Pattern Specification Language,
> much more powerful than the usual regular expression.
>
> Both the standard and extended regular expression are nice, in that we
> get used to using them, especially with grep, and without thinking too
> much about them.
>
> I suspect, though, that if they hadn't previously been defined, we
> might come up with something different today.

Whether or not regexes are defined:

- we would still have have the concept of a machine with a finite number
  of states.

- the result would hold that a machine with a finite number of states
  can only recognize certain sets of strings (what we call "regular
  languages"), and that those sets can be infinite.

- the observation would still be made that those sets of strings have
  certain features, like expressing certain kinds of repetitions,
  but not other repetitive patterns such as:
  - an arbitrary number of N parentheses followed by a N closed
    ones, for any N.

- obvious compressed notations would suggest themselves for expressing
  the features of those sets.

- someone would dedicate him or herself toward finding the minimal set
  of useful operations in the notation which can capture all such
  sets (e.g. the same process by which we know that ? and + are not
  necessary if we have the Kleene * and branching because
  A+ is just AA*, and A? is (A|).  The Kleene star and branching would
  surely be rediscovered.

We would end up with regex under a different name, using different
notations: maybe some other symbol instead of star, perhaps in
a different position, like prefix instead of suffix, or whatever.

> Among others, PSL has the ability to define approximate matches,
> such as a word with one or more misspellings, that is insertions,
> deletions, or substitutions. Usual RE don't have that ability.

This may be great for some explorative programming, but doesn't do much
when you're writing a compiler for a very specific, defined language.

Programmers misspell not only the fixed tokens of a language, but also
program-defined identifiers like function names, variables, and types.

Today, when a C compiler says "undeclared identifier `pintf`, did you
mean `printf`?", this is not based on some misspelling support in the
lexical analyzer, and could not reasonably be. First the error is
identified in the ordinary way, and then some algorithm that is entirely
external to parsing is applied to the symbol tables to find identifiers
similar to the undeclared one.

> There are also PSL expressions for ranges of numbers.
> You can often do that with very complicated RE, considering
> all of the possibilities.  PSL automatically processes those
> possibilities.  (Some can expand to complicated code.)

But ranges of numbers are regular sets. You can have a macro operator
embedded in a regex language whuch generates that same code.

For instance for matching the decimal strings 27 to 993, there is a
regex, and a way of calculating that regex.

We know thre is a regex because the strings set{ "27", "28", ..., "993" }
is a regular set, being finite. We can form a regex just by combining
those elements with a | branch operator.

We can do something which condenses some of the redundancy like:

  9(9(|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7
  |6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8
  |7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9
  |8|7|6|5|4|3|2|1|0))|8(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)
  |7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|
  0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|
  1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|7(9(|9|8|7|6|5|4
  |3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5
  |4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6
  |5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7
  |6|5|4|3|2|1|0))|6(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|
  9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4
  (|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)
  |1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|5(9(|9|8|7|6|5|4|3|2
  |1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3
  |2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4
  |3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5
  |4|3|2|1|0))|4(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|
  7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|
  8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|
  9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|3(9(|9|8|7|6|5|4|3|2|1|0
  )|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1
  |0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2
  |1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3
  |2|1|0))|2(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|
  5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5|4|3|2|1|0)|4(9|8|7|6|5
  |4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|4|3|2|1|0)|1(9|8|7|6|5|
  4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))|1(9(9|8|7|6|5|4|3|2|1|0)|8(9|8|7|6|
  5|4|3|2|1|0)|7(9|8|7|6|5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5
  |4|3|2|1|0)|4(9|8|7|6|5|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|
  4|3|2|1|0)|1(9|8|7|6|5|4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))

where we can better notate sequences like (9|8|7|6|5|4|3|2|1|0) as
[0-9].

What I did there was turn these things into a trie, and then just
transliterated that trie into regex syntax.

(The digits appear in reverse because the trie implementation I'm using
relies on hash tables, and hash tables don't have a specified order; the
actual order observed as an artifact of the hashing function. In modern
systems that function can be perturbed by a randomly generated key for
thwarting hash table attacks.)

Anyway, that sort of thing being what it is, the mechanism for
generating it thing can be readily embedded as a syntactic sugar into a
regex language, without making it non-regular in any way.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[To put it another way, the set of strings you can recoginize with a
NFA or DFA is the same as the set of strings you can describe with a regex.
A DFA is such an obvious thing that we would have reverse engineered
regexes from them if Ken Thompson hadn't done it the other way. -John]