Groups > comp.compilers > #2991 > unrolled thread

Flex is the most powerful lexical analysis language in the world. True or False?

Started by	Roger L Costello <costello@mitre.org>
First post	2022-05-04 11:22 +0000
Last post	2022-05-07 13:10 -0700
Articles	9 — 5 participants

Back to article view | Back to comp.compilers

  Flex is the most powerful lexical analysis language in the world. True or False? Roger L Costello <costello@mitre.org> - 2022-05-04 11:22 +0000
    Re: Flex is the most powerful lexical analysis language in the world. True or False? Tom Shields <thomas.evans.shields@gmail.com> - 2022-05-04 14:14 -0500
    Flex is the most powerful lexical analysis language in the world. True or False? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-05-05 15:20 +0300
      RE: Flex is the most powerful lexical analysis language in the world. True or False? Roger L Costello <costello@mitre.org> - 2022-05-06 11:16 +0000
        RE: Flex is the most powerful lexical analysis language in the world. True or False? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-05-07 13:15 +0300
          Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ] Roger L Costello <costello@mitre.org> - 2022-05-08 13:34 +0000
    Re: Flex is the most powerful lexical analysis language in the world. True or False? George Neuner <gneuner2@comcast.net> - 2022-05-06 11:00 -0400
      Re: Flex is the most powerful lexical analysis language in the world. True or False? gah4 <gah4@u.washington.edu> - 2022-05-06 14:30 -0700
        Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False? gah4 <gah4@u.washington.edu> - 2022-05-07 13:10 -0700

#2991 — Flex is the most powerful lexical analysis language in the world. True or False?

From	Roger L Costello <costello@mitre.org>
Date	2022-05-04 11:22 +0000
Subject	Flex is the most powerful lexical analysis language in the world. True or False?
Message-ID	<22-05-003@comp.compilers>

Hi Folks,

1. A lexical analysis language that exclusively provides regular expressions
for scanning input can only process regular languages.

(a) True
(b) False

2. Flex provides, in addition to regular expressions, states and a pushdown
stack. This greatly expands the set of languages that can be processed.

(a) True
(b) False

3. Because Flex provides states and a pushdown stack, Flex lexers can process
context-free languages.

(a) True
(b) False

4. No other lexical analysis language provides states and a pushdown stack.

(a) True
(b) False

5. Flex is the most powerful lexical analysis language in the world.

(a) True
(b) False

/Roger
[I think that you could easily graft a state stack into any lexer that has start states.
Also, tools like Antlr combine the lexer and parser generators, so they're at least as
powerful as flex. -John]

[toc] | [next] | [standalone]

#2992

From	Tom Shields <thomas.evans.shields@gmail.com>
Date	2022-05-04 14:14 -0500
Message-ID	<22-05-005@comp.compilers>
In reply to	#2991

FYI: RE/flex [https://github.com/Genivia/RE-flex] provides a superset
of flex functionality, generates a better implementation of a scanner
in C++ than does flex, and, most importantly, is actively maintained.

Tom Shields

[toc] | [prev] | [next] | [standalone]

#2993

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-05-05 15:20 +0300
Message-ID	<22-05-007@comp.compilers>
In reply to	#2991

First, of all, I think Flex is a relatively fine lexer generator with
some good properties and reasons why it is popular, but claiming it is
"the most powerful lexical analysis language in the world" is way too
far overstating it.

It is not even as powerful as some other lexical analysis languages
and even exploiting its power often requires things that cannot be
expressed in Flex alone nor can they be done in ways that are simple
and easy to reason about. So, that part of the statement is patently
false. I will answer the question, but go beyond simple, yes and no
responses to show the actual issues and not some over-simplification
of the problem.

> 1. A lexical analysis language that exclusively provides regular expressions
> for scanning input can only process regular languages.

True if you mean automata theoretic regular expressions, i.e. not
including PCRE extensions, like back references and captures and
zero-width assertions. However, if you include such extensions (or
other ones I will mention below), you can escape those limits and get
to larger language classes, including ones that are Turing
Complete--not that Turing Machine power cannot be abused and result in
a lexer who you cannot figure out what the tokens are without solving
the Halting Problem. So, what one really wants is a nicely constrained
but larger lexing language that is still easy to reason about. There
are tools that approximate that. None are perfect yet (if perfection
is even achievable and not merely a limit to be approached but never
reached).

> 2. Flex provides, in addition to regular expressions, states and a pushdown
> stack. This greatly expands the set of languages that can be processed.
> 3.  Because Flex provides states and a pushdown stack, Flex lexers can process
> context-free languages.

True and true.  States by themselves don't add anything.  The
intersection or union of two regular expression languages is still
regular.  However, a pushdown automata can allow you to process a
context free language.  Now, not having used that feature myself, I
don't know whether one has to "program" the automata by hand or one
can express it "naturally" as context free rules.  I suspect the
latter, but don't know for sure.

In comparison, in 1986, when we built the lexer for our version of
Yacc++, we integrated regular expressions and LR rules, so that one
can simply write yacc-like rules with recursion to express
[deterministic, i.e. ELR] grammars (with regular expressions on the
right hand side).  To my mind, the notation being unified and used
both in the lexer and the parser is easier to comprehend and within
reason is much easier to reason about.    It makes things like nested
comments into an "obvious" recursive lexical rule.

We also have lexer classes which allow one to do lexer states (for
context sensitivity) with inheritance between them and you can change
their states (which class is being lexed) via rules in the lexer or
parser, doing the things that Flex does.

Terence Parr liked our idea and eventually incorporated it into ANTLR
(originally doing an LL version of it), the current version in ANTLR4
is more like a PEG (parsing expression grammar) with extensions to
handle simple (aka direct) left-recursion and precedence.

Moreover, he has some simple lexer extensions that look quite FLEX
like (rules for changing "state" in a controlled fashion including
pushing/popping from a stack).  Being more constrained, it is actually
easier to reason about.  You aren't trying to analyze arbitrary C
(Java) code.

> 4.  No other lexical analysis language provides states and a pushdown stack.
> 5. Flex is the most powerful lexical analysis language in the world.

False and false.  See the preceding paragraphs.  Not only do they do
so.  They do so in a manner which is more easy to reason about.

Moreover, syntactic (and semantic) predicates (an innovation Terence
introduced and we borrowed later) allow one to express languages that
are larger than the context free family.  In fact, there are proofs
that one can use them to express any Turing Complete language.  They
are typically implemented using backtracking.

However, Bryan Forward took the concept of predicates and developed
packrat parsing as a way of implemented PEGs efficiently, effectively
linear.  And, the notation is only minorly different than context free
grammars, with the main difference being that the or (alternative) is
treated as ordered, if the first rule matches apply it and don't even
try the following rules.  As I noted above, this nicely solves certain
precedence problems.  (And, as I said, ANTLR4 incorporates this idea
to positive effect.)

This work has further been extended by Alexander Okhotin into a
variety of "Conjuctive" and "Boolean" grammars.  These again have
Turing Machine power, but are usually simpler to understand and reason
about than most Turing Machines.

In a different direction researchers like Quinn-Tyler Jackson have
developed Adaptive Grammars (see the relevant Wikipeida article).
This is a different solution to the problem of making a constrained
version of Turing Machine power.  Notably they are well designed to
handle "type checking" with the equivalent of the C typedef hack, but
without it being a hack but a formalism that can give clear semantics.
Note that this is traditionally very hard to do in context free
grammars--and as far as I can tell, most people writing lexers and
parsers flounder trying to incorporate even simple type checking into
their grammar, by using the wrong tool (context free grammars) to do
so.

There are other directions that have been pursued also.  Scanner-less
parser generators, which merge the lexing and parsing into one phase.
GLR parsers (and presumably lexers) that deal with ambiguity by making
forests rather than just trees/DAGs.  Parsing combinators can be used
for lexing also.  There are probably even concepts that I have never
read about or have read about and forgotten.

Many of these advances have been incorporated into lexer generators.
They are [almost] all more powerful lexical analysis languages than
what Flex provides.   Most of them are more clear and easier to reason
about also.

So, pardon my knee-jerk reaction to your question.  It is simply that
most people aren't aware of all the different technologies that are
available.  Flex is fine, but it is certainly not a be-all-end-all
solution.  It wouldn't even be the solution I would reach for first.

Kind regards,
Chris

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#2994

From	Roger L Costello <costello@mitre.org>
Date	2022-05-06 11:16 +0000
Message-ID	<22-05-009@comp.compilers>
In reply to	#2993

Thank you Chris for that superb explanation!

> Flex is fine, but it is certainly not a be-all-end-all
> solution.  It wouldn't even be the solution I would
> reach for first.

May I ask, what would be the solution that you would reach for first?

/Roger
[I'm not Chris but the answer is the one that does the job.  You don't need
a flame thrower if you only need to light the BBQ. -John]

[toc] | [prev] | [next] | [standalone]

#2999

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2022-05-07 13:15 +0300
Message-ID	<22-05-018@comp.compilers>
In reply to	#2994

Roger, since you asked, I will answer what solution I would reach for
(and give advice on what I think others should reach for).

First, John's advice is straight on.  You don't need a flamethrower in
most cases, and in fact having one invites one to abuse it.  I think
Purdue has a series of videos on the fastest way to light a barbeque
which end with completely melting the bbq in a matter of seconds using
liquid oxygen.  Fast, but probably the wrong solution for grilling
burgers and hot dogs.

Next, if you already have a lexer, I would NOT change the technology
using it, at least not by much, unless one had a specific reason to do
otherwise.  I might switch from original LEX to Flex (and original
yacc to Bison), but that would be one of the few exceptions to rule.
Now, if I had issues I would switch, but I would first attempt to see
if there are workarounds.

in my last three projects, there was already a lexer-parser
combinartion in use.  So, in my last three projects (and four
implementations) we used: Bison + Flex, ANTLR, Parser-RS, and JAVACC.
In the last two, I don't even know what lexer was used as I never
touched it other than to add keywords.

And that goes to an important point.  Your lexer *should be* almost
trivially simple (i.e. regular expressions only and not complicated
ones).  You rarely want to solve problems at the lexical level.  You
are much less likely to get good error reporting if you do.  In most
cases, your parser should be simple also.  You might want LR parsing
for expressions, but otherwise you want your grammar to be LL(1) (with
the if-then-else hack).

And, all else being equal, if you don't have a lexer-parser
combination you can reuse.  I would pick somewhat based upon
programming language, since most tools are relatively tied to one
language even when they support more than one.

I haven't decided on my favorite for Rust yet, parser-RS isn't bad,
Nom is also popular.

For Java, I would go with ANTLR4.  And, overall, I would say that is
my current favorite despite a few nits.

For C++ or C#, I would use the Yacc++ we wrote even though it needs
some tweaking to catch up to ANTLR at this point.  I prefer our
solution to keywords to what ANTLR has and indirect left recursion
(i.e. parsing expressions with lists of expressions as a first
argument) doesn't work right in ANTLR.

If someone else was paying for it, I would investigate the DMS Toolkit
for Semantic Designs, because they have done most of the work to make
GLR practical.

If I really wanted to solve types in my grammar, I would look into
"meta-ess" by Jackson.  I don't know how available that is.

And, if I were using Scheme or Lisp, I would look into Racket.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#3001 — Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ]

From	Roger L Costello <costello@mitre.org>
Date	2022-05-08 13:34 +0000
Subject	Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ]
Message-ID	<22-05-022@comp.compilers>
In reply to	#2999

Thank you again Chris. Terrific information.

Another question if I may. You wrote:

> And that goes to an important point.  Your lexer *should be* almost
> trivially simple (i.e. regular expressions only and not complicated
> ones).  You rarely want to solve problems at the lexical level.  You
> are much less likely to get good error reporting if you do.  In most
> cases, your parser should be simple also.

For a while now I have been (for fun) working on building a parser for
parsing XML documents. I have experimented with making the lexer
simple and with making the parser simple. If I make the lexer simple,
then the parser is complex. If I make the lexer complex (using lots of
states and making heavy use of Flex's pushdown stack) then the parser
is simple. It doesn't seem possible to make both the lexer and parser
simple.

There are lots of "conditional rules" in XML. For example, in XML the
&amp; is called an "XML entity." Since the & is a reserved symbol, XML
documents need to use &amp; instead of &. An XML parser is to convert
&amp; to &. However, if the &amp; is in certain contexts -- within a
comment or within a CDATA section -- then the &amp; is not converted.
Thus, there is conditional processing:

IF (&amp; is in a comment or in a CDATA section) THEN
    OUTPUT(&amp;)
ELSE
   OUTPUT(&)

Flex's states/stack mechanism is ideally suited for conditional
processing like this. From the section on Start Conditions in the Flex
manual: "flex provides a mechanism for conditionally activating
rules."

So while it would be great to have a simple lexer, I am leaning
towards dealing with the conditional rules in XML using the Flex
states/stack mechanism rather than dealing with the conditional rules
in Bison. In other words, I am leaning towards a complex lexer.

I am interested in hearing your thoughts on this.

> You don't need a flamethrower

My apologies. It wasn't my intent to throw a flame. But in hindsight I
can see that I should have worded things much better. I will do better
in the future.

/Roger

[toc] | [prev] | [next] | [standalone]

#2996

From	George Neuner <gneuner2@comcast.net>
Date	2022-05-06 11:00 -0400
Message-ID	<22-05-011@comp.compilers>
In reply to	#2991

On Wed, 4 May 2022 11:22:34 +0000, Roger L Costello
<costello@mitre.org> wrote:

>Hi Folks,
>1. A lexical analysis language that exclusively provides regular expressions
>for scanning input can only process regular languages.
>2. Flex provides, in addition to regular expressions, states and a pushdown
>stack. This greatly expands the set of languages that can be processed.
>3. Because Flex provides states and a pushdown stack, Flex lexers can process
>context-free languages.
>4. No other lexical analysis language provides states and a pushdown stack.
>5. Flex is the most powerful lexical analysis language in the world.

>[I think that you could easily graft a state stack into any lexer that has start states.
>Also, tools like Antlr combine the lexer and parser generators, so they're at least as
>powerful as flex. -John]

+1 John.

Roger, if you hadn't already asked some more interesting questions, I
would suspect this 'test' was homework.

Flex is powerful, but it certainly is not alone.  As John's response
hinted, there are (plenty of other) tools that more or less are
equivalent.  And not all of them are based on LR.
https://en.wikipedia.org/wiki/Comparison_of_parser_generators

Not to mention that programming languages which tend to actually be
used also tend to be [relatively] easily parsed using LL(k).

LR is useful AS AN IMPLEMENTATION TECHNIQUE, but in general if your
language is complex enough to really /require/ (G)LR or PEG parsing,
it probably is too complicated to be used by average programmers.

YMMV,
George

[toc] | [prev] | [next] | [standalone]

#2997

From	gah4 <gah4@u.washington.edu>
Date	2022-05-06 14:30 -0700
Message-ID	<22-05-015@comp.compilers>
In reply to	#2996

On Friday, May 6, 2022 at 9:14:54 AM UTC-7, George Neuner wrote:

(snip)

> Not to mention that programming languages which tend to actually be
> used also tend to be [relatively] easily parsed using LL(k).

An important part of a programming language is that people can understand it.

I suspect it isn't hard to design a language that computers can easily
parse, but people can't.  Your lexer only needs to be good enough for
actual programming languages.

As with BBQs, that doesn't stop people from trying.
[Take a look at Postscript, which is trivial to tokenize and parse since
it's a stream of tokens in RPN order, but making sense of it
by humans is a challenge.  Or, of course, m4.  -John]

[toc] | [prev] | [next] | [standalone]

#3000 — Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False?

From	gah4 <gah4@u.washington.edu>
Date	2022-05-07 13:10 -0700
Subject	Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False?
Message-ID	<22-05-020@comp.compilers>
In reply to	#2997

(snip, I wrote)

> An important part of a programming language is that people can understand it.

(snip)

> [Take a look at Postscript, which is trivial to tokenize and parse since
> it's a stream of tokens in RPN order, but making sense of it
> by humans is a challenge. Or, of course, m4. -John]

Reminds me of the origin of RISC, when (I believe) IBM noticed that
compilers were using a small subset of the available instructions, and that
much less programming was being done in assembly.

But okay, was Postscript supposed to be written by people,
or programs?

As with code generated by compilers, it has to be understood
by people at least once, but even then, only step by step, and not
(usually) the whole program at once.

But yes you can write unreadable Postscript.  Once, some years ago,
we (me and some others) needed to redefine def.

[Actually, RISC was at Berkeley, and IBM's project was the 801. But
yes, they noticed compilers used only a fraction of the S/360
instruction set so they made a minimal design that supported only what
their state-of-the art compiler used. RISC was sort of the same but
they used the mediocre PCC compiler which is why they had register
windows to compensate for PCC's weak register allocation. -John]

[toc] | [prev] | [standalone]

csiph-web

Flex is the most powerful lexical analysis language in the world. True or False?

Contents

#2991 — Flex is the most powerful lexical analysis language in the world. True or False?

#2992

#2993

#2994

#2999

#3001 — Simple Lexer and Simple Parser [ was RE: Flex is the most powerful lexical analysis language in the world. True or False? ]

#2996

#2997

#3000 — Re: fun with Postscript, was Flex is the most powerful lexical analysis language in the world. True or False?