Groups > comp.compilers > #3047 > unrolled thread

Re: State-of-the-art algorithms for lexical analysis?

Started by	Roger L Costello <costello@mitre.org>
First post	2022-06-06 10:48 +0000
Last post	2022-06-06 12:25 -0700
Articles	3 — 2 participants

Back to article view | Back to comp.compilers

  Re: State-of-the-art algorithms for lexical analysis? Roger L Costello <costello@mitre.org> - 2022-06-06 10:48 +0000
    Re: State-of-the-art algorithms for lexical analysis? gah4 <gah4@u.washington.edu> - 2022-06-06 10:03 -0700
    Re: State-of-the-art algorithms for lexical analysis? gah4 <gah4@u.washington.edu> - 2022-06-06 12:25 -0700

#3047 — Re: State-of-the-art algorithms for lexical analysis?

From	Roger L Costello <costello@mitre.org>
Date	2022-06-06 10:48 +0000
Subject	Re: State-of-the-art algorithms for lexical analysis?
Message-ID	<22-06-009@comp.compilers>

gah4 wrote:

> Pattern Specification Language (PSL) is
> much more powerful than the usual
> regular expression.

Neat!

> I suspect that if regexes hadn't previously
> been defined, we might come up with
> something different today.

Wow! That is a remarkable statement.

I will look into PSL. There are algorithms for converting regexes to DFA and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?

/Roger

[toc] | [next] | [standalone]

#3049

From	gah4 <gah4@u.washington.edu>
Date	2022-06-06 10:03 -0700
Message-ID	<22-06-011@comp.compilers>
In reply to	#3047

On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

(snip)

> I will look into PSL. There are algorithms for converting regexes to DFA
> and then using the DFA to tokenize the input. Are there algorithms for
> converting PSL to (what?) and then using the (what?) to tokenize the input?

The approximate searches are done using dynamic programming.
The penalty is 1 for insertion, deletion, or substitution and the score
is in 3 bits, so up to six spelling errors.

The whole query is then compiled into code for a systolic array,
which then runs as fast as the data comes off disk.

FDF2 is a 9U VME board that runs in a VME based Sun system.

FDF3 connects directly to a SCSI disk, and also to a Sun workstation.
In searching, it transfers directly from the disk.  To load data into
the disk, the disk is accessed indirectly through the FDF3.
It is a desktop box, about the size of a large external SCSI disk.

Some of it is described here:

https://aclanthology.org/X93-1011.pdf

along with its use for searching Japanese text, and:

https://trec.nist.gov/pubs/trec3/papers/paper.ps.gz

[toc] | [prev] | [next] | [standalone]

#3052

From	gah4 <gah4@u.washington.edu>
Date	2022-06-06 12:25 -0700
Message-ID	<22-06-014@comp.compilers>
In reply to	#3047

On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

(snip, I wrote)
> > I suspect that if regexes hadn't previously
> > been defined, we might come up with
> > something different today.

> Wow! That is a remarkable statement.

Well, mostly, regex were defined based on what was reasonable to do on
computers at the time. It seems reasonable, then, with the more powerful
computers of today, to expect that more features would have been added.

Some of that was done in the later ERE, Extended Regular Expression.

But there is a strong tendency not to break backward compatibility,
and so not add new features later.
[See my note about DFAs a few messages back. EREs are just syntactic
sugar on regular REs so sure. PCREs are swell but they are a lot
slower since backreferences mean you need to be able to back up.
-John]

[toc] | [prev] | [standalone]

csiph-web

Re: State-of-the-art algorithms for lexical analysis?

Contents

#3047 — Re: State-of-the-art algorithms for lexical analysis?

#3049

#3052