Groups > comp.compilers > #3385 > unrolled thread

syntax complexity

Started by	gah4 <gah4@u.washington.edu>
First post	2023-02-15 15:08 -0800
Last post	2023-02-21 20:54 +0200
Articles	14 — 9 participants

Back to article view | Back to comp.compilers

  syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-15 15:08 -0800
    Re: syntax complexity Thomas Koenig <tkoenig@netcologne.de> - 2023-02-16 06:32 +0000
    Re: syntax complexity Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2023-02-16 12:03 +0100
      Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-16 11:33 -0800
        Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-16 16:08 -0800
        Re: syntax complexity Roger L Costello <costello@mitre.org> - 2023-02-20 15:09 +0000
          Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-20 09:57 -0800
            Re: syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-21 08:14 +0000
              Re: syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-21 18:39 +0000
                Re: ireegular expressions, syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-22 10:55 +0000
              Re: irregular expressions, syntax complexity arnold@freefriends.org (Aharon Robbins) - 2023-02-22 08:53 +0000
              Re: irregular expressions, syntax complexity Kaz Kylheku <864-117-4973@kylheku.com> - 2023-02-23 00:34 +0000
          Re: syntax complexity George Neuner <gneuner2@comcast.net> - 2023-02-20 13:49 -0500
        syntax complexity Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2023-02-21 20:54 +0200

#3385 — syntax complexity

From	gah4 <gah4@u.washington.edu>
Date	2023-02-15 15:08 -0800
Subject	syntax complexity
Message-ID	<23-02-045@comp.compilers>

I started this in another thread, but I think it deserves its own.

The question is, how does one measure syntax complexity, with the
specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
for now.)

Even back to the 1970's, I always found that Fortran had strange and
(seemingly) arbitrary rules on its syntax.

Some of them made sense on the small, early, computers, but didn't go
away later. Even worse, as new features were added, and some of the
rules were relaxed, new ones appeared.

On the other hand, PL/I from the beginning had more general rules.
And, it seems to me, more general rules lead to simpler syntax. That
is especially true for expressions, where PL/I allows pretty much the
same expression syntax everywhere it allows expressions.

PL/I had internal procedures from the beginning, though only added to
Fortran much later. But when they were, Fortran didn't allow them to
be nested. Only one level deep! That might have changed more recently,
or maybe now two levels.

One complication I see, is that syntax complexity as seen by people,
might be different from it as seen by programs.

[toc] | [next] | [standalone]

#3386

From	Thomas Koenig <tkoenig@netcologne.de>
Date	2023-02-16 06:32 +0000
Message-ID	<23-02-046@comp.compilers>
In reply to	#3385

gah4 <gah4@u.washington.edu> schrieb:
> I started this in another thread, but I think it deserves its own.
>
> The question is, how does one measure syntax complexity, with the
> specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
> for now.)

A reasonable zeroth-order approximation is the length of the language
standard (which would also include semantics).

Fortran 2018: 599 pages, without the index.

C++ 2020: 1663 pages, without cross-references and indices.

C (2020 draft): 393 pages without annexes, 572 pages including
the non-normative annexes.

It is probably not fair to compare document sizes of early standards
like the very first programming language standard, Fortran 66.
That was very short at 36 pages, but people were still learning
how to write language standards at the time.

Another very rough measure would be the size of a front end in the
same compiler.  Using the even rougher estimate of text lines
for gcc, I get

$ cat fortran/*.cc fortran/*.h | wc -l
226682
$ cat c/*.cc c/*.h | wc -l
60548
$ cat d/*.cc | wc -l
24850
$ cat ada/*.adb ada/*.ads ada/*.c ada/*.h  | wc -l
731635

(Right now, I'm not sure how to identify all the files in the C++
front end. Ada may be misrepresented because the compiler is mostly
written in Ada instead of C++, and it is a more verbose language than
C++. The Fortran front end is nominally C++, but due to its history
mostly sticks to the common subset of C and C++. This is text lines,
not statements. I'm not sure if I included the Ada runtime library in
that estimate or not. Like I wrote above, a rough estimate).

[toc] | [prev] | [next] | [standalone]

#3387

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2023-02-16 12:03 +0100
Message-ID	<23-02-047@comp.compilers>
In reply to	#3385

On 2/16/23 12:08 AM, gah4 wrote:
> I started this in another thread, but I think it deserves its own.
>
> The question is, how does one measure syntax complexity,

In the original thread I was missing more specific distinctions:
- tools: flex/bison, ANTLR, CoCo/R...
- parser types: LL, LR, LF, LALR...
-   or(?): recursive-descent, shift-reduce...
- grammar syntax: BNF, EBNF...
- grammar types: context-free, -sensitive...

> with the
> specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics,
> for now.)
[...]

On that aspect I can not contribute anything, sorry :-(


> One complication I see, is that syntax complexity as seen by people,
> might be different from it as seen by programs.

How do programs recognize syntax complexity?
Number of rules and exceptions?

DoDi

[toc] | [prev] | [next] | [standalone]

#3389

From	gah4 <gah4@u.washington.edu>
Date	2023-02-16 11:33 -0800
Message-ID	<23-02-050@comp.compilers>
In reply to	#3387

On Thursday, February 16, 2023 at 9:57:53 AM UTC-8, Hans-Peter Diettrich wrote:
> On 2/16/23 12:08 AM, gah4 wrote:

(snip)

> > One complication I see, is that syntax complexity as seen by people,
> > might be different from it as seen by programs.

> How do programs recognize syntax complexity?
> Number of rules and exceptions?

When I wrote that one, I was thinking about the places where Fortran uses
special characters and PL/I uses words.

     DO I=1,10,3

     DO I = 1 TO 10 BY 3;

I think about them in a different way, such that the thought complexity is different.

A compiler doesn't "think" in that way.

I suppose I agree with the above, the length of the standard, with some
assumptions on how it is written, or the length of the front end.
[Having written a couple of Fortran parsers, I can say that while the hacks
to deal with ignored spaces were ugly, they weren't that hard.  PL/I has a
separate issue that the same token might be a keyword or a variable depending
on context, and the kinds of parsers you build with bison et al don't deal
very well with that. -John]

[toc] | [prev] | [next] | [standalone]

#3390

From	gah4 <gah4@u.washington.edu>
Date	2023-02-16 16:08 -0800
Message-ID	<23-02-051@comp.compilers>
In reply to	#3389

On Thursday, February 16, 2023 at 2:46:25 PM UTC-8, gah4 wrote:

(snip)

> When I wrote that one, I was thinking about the places where Fortran uses
> special characters and PL/I uses words.

> DO I=1,10,3

> DO I = 1 TO 10 BY 3;

> I think about them in a different way, such that the thought complexity is different.

> A compiler doesn't "think" in that way.

> I suppose I agree with the above, the length of the standard, with some
> assumptions on how it is written, or the length of the front end.

(and our moderator wrote)
> [Having written a couple of Fortran parsers, I can say that while the hacks
> to deal with ignored spaces were ugly, they weren't that hard. PL/I has a
> separate issue that the same token might be a keyword or a variable depending
> on context, and the kinds of parsers you build with bison et al don't deal
> very well with that. -John]

Well there is that, but so far I was only thinking about the difference
between commas and keywords.

      WRITE(2,*) A, B, C
      PUT  FILE(TWO) SKIP LIST(A, B, C);

or

      READ(3'K) X
      READ FILE(THREE) INTO(X) KEY(K);

the funny IBM use of a single apostrophe for direct access files.

I had a part of a summer undergrad project working with the STEP
macro processor, which I wrote about some time ago.
I was writing the parser for IBM Fortran (not including leaving
out spaces between words), but there is no way to match a single
apostrophe!   Parsing string constants was done at a very low level.
[Like I said the hacks were ugly.  For example, this statement
contains a Hollerith constant:

123   FORMAT(4HELLO)

but this one does not:

      REAL*4HELLO

I was doing F77 which didn't do that strange quote thing but it's easy
enough to tell from context, since a quoted string has to follow a
punctuation mark that is not a close paren. -John]

[toc] | [prev] | [next] | [standalone]

#3391

From	Roger L Costello <costello@mitre.org>
Date	2023-02-20 15:09 +0000
Message-ID	<23-02-052@comp.compilers>
In reply to	#3389

Hello Compiler Experts!

Scenario: you have a language that has a BNF. You write a statement in the language. It is a relatively simple, basic statement. The statement conforms to the BNF. To show its conformance, you write the derivation of the statement. Surprisingly, deriving the statement takes many, many rules. Does that signify that the language's syntax is too complex?

/Roger

[toc] | [prev] | [next] | [standalone]

#3392

From	gah4 <gah4@u.washington.edu>
Date	2023-02-20 09:57 -0800
Message-ID	<23-02-053@comp.compilers>
In reply to	#3391

On Monday, February 20, 2023 at 9:46:04 AM UTC-8, Roger L Costello wrote:

> Scenario: you have a language that has a BNF. You write a statement in the language.
> It is a relatively simple, basic statement. The statement conforms to the BNF.
> To show its conformance, you write the derivation of the statement.
> Surprisingly, deriving the statement takes many, many rules.
> Does that signify that the language's syntax is too complex?

One suggestion above based it on the size of the standard. That isn't
quite as good as it could be, as some standard writers are wordier
than others, but maybe not so bad.

Another choice is on the size of the BNF.

I don't think I would rate it on one statement, but the whole BNF of
the language. One statement, such as an assignment statement, could
use a lot of rules. Whole language syntax complexity should be on the
whole BNF.

One complication, depending on how you write BNF. Some statements,
such as in PL/I, have items that can be in any order, but you only
have at most one of each. That is complicated to write in BNF, but I
don't believe complicates the syntax, as seen by people.
[There are definitely things that are hard to say in BNF, even though
they're intuitively simple.  Another is floating point numbers with
optional "." and "E" but you need at least one of them. -John]

[toc] | [prev] | [next] | [standalone]

#3394

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2023-02-21 08:14 +0000
Message-ID	<23-02-055@comp.compilers>
In reply to	#3392

Our moderator writes:

>[There are definitely things that are hard to say in BNF, even though
>they're intuitively simple.

An example is the solution to the dangling-else problem that we
discussed some time ago.  At least one of the language standards I
looked at during this discussion specified an ambiguous grammar for
the IF-statement, with the disambiguation coming from prose, rather
than specifying an unambiguous grammar.

Of course, better than either solution is to design the language to
require that an IF-statement is terminated with, e.g., fi (Algol 68) or END
(Modula-2).

>Another is floating point numbers with
>optional "." and "E" but you need at least one of them. -John]

Regular expression syntax is missing an operator that signifies the
intersection of the sets recognized by the operand regexps.  Let's
call this operator "&".  Then this requirement for an FP number can be
expressed as:

([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*

"[0-9.]*" specifies a lenient form of the mantissa part; ".*[0-9].*"
specifies a string that has at least one digit; so
"([0-9.]*&.*[0-9].*)" says that the mantissa part can contain digits
and "." and must contain one digit.  "(E[0-9]+)?" specifies the
optional exponent part.  So "([0-9.]*&.*[0-9].*)(E[0-9]+)" is a
lenient form of an FP number that may miss both "." and "E".
".*[.E].*" specifies the requirement stated above: The number must
contain a "." or an "E".  Again the "&" combines these requirements.

You can translate regexps with & to a DFA, and don't lose any
performance there.  For an NFA-based implementation, the only way I
see is that you process both sub-NFAs and check if they both accept
the string, so it's somewhat slower; I guess that this is a major
reason why such an operator has not been added to widely-used regexp
syntaxes.

Has anybody implemented such an operator at all?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Wouldn't that pattern allow 1.2.3 ? -John]

[toc] | [prev] | [next] | [standalone]

#3398

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2023-02-21 18:39 +0000
Message-ID	<23-02-059@comp.compilers>
In reply to	#3394

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Regular expression syntax is missing an operator that signifies the
>intersection of the sets recognized by the operand regexps.  Let's
>call this operator "&".  Then this requirement for an FP number can be
>expressed as:
>
>([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*
...
>[Wouldn't that pattern allow 1.2.3 ? -John]

Good point.  The & operator makes it easy to fix this bug (once you
have found it):

([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*&[^.]*[.]?[^.]*

But it suggests that I tried to put too many requirements into the
first part, so let's try again:

At most one "." in front of at most one "E":  [0-9]*[.]?[0-9]*E?[0-9]*
At least one of "." or "E":                   .*[.E].*
At least one digit in front of and after "E": .*[0-9].*E[0-9]+

In total:

[0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*E[0-9]+

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

[toc] | [prev] | [next] | [standalone]

#3400 — Re: ireegular expressions, syntax complexity

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2023-02-22 10:55 +0000
Subject	Re: ireegular expressions, syntax complexity
Message-ID	<23-02-061@comp.compilers>
In reply to	#3398

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>At most one "." in front of at most one "E":  [0-9]*[.]?[0-9]*E?[0-9]*
>At least one of "." or "E":                   .*[.E].*
>At least one digit in front of and after "E": .*[0-9].*E[0-9]+

There's still a bug here, because that makes "E" non-optional.  Let's
make it optional:

At least one digit, when using E one before and one after: .*[0-9].*(E[0-9]+)?

In total:

[0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*(E[0-9]+)?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[This doesn't seem all that much simpler than the usual approach with
alternation, give or take bugs:
  ([0-9]+\.[0-9]*|\.[0-9]+)(E[0-9]+)?|[0-9]+E[0-9]+
-John]

[toc] | [prev] | [next] | [standalone]

#3399 — Re: irregular expressions, syntax complexity

From	arnold@freefriends.org (Aharon Robbins)
Date	2023-02-22 08:53 +0000
Subject	Re: irregular expressions, syntax complexity
Message-ID	<23-02-060@comp.compilers>
In reply to	#3394

In article <23-02-055@comp.compilers>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>Regular expression syntax is missing an operator that signifies the
>intersection of the sets recognized by the operand regexps.  Let's
>call this operator "&". ....
>
>Has anybody implemented such an operator at all?

Doug McIlroy did, if I understand things correctly. See
https://github.com/arnoldrobbins/mcilroy-regex/blob/master/README
for more info.

Arnold

[toc] | [prev] | [next] | [standalone]

#3401 — Re: irregular expressions, syntax complexity

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-02-23 00:34 +0000
Subject	Re: irregular expressions, syntax complexity
Message-ID	<23-02-062@comp.compilers>
In reply to	#3394

On 2023-02-21, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> You can translate regexps with & to a DFA, and don't lose any
> performance there.  For an NFA-based implementation, the only way I
> see is that you process both sub-NFAs and check if they both accept
> the string, so it's somewhat slower; I guess that this is a major
> reason why such an operator has not been added to widely-used regexp
> syntaxes.
>
> Has anybody implemented such an operator at all?

I implemented & in the TXR language.

It's based on Brzozowski regex derivatives (described first in 1964,
IIRC).

Derivatives allow a regex syntax with & to be interpreted directly,
and without backtracking or trying sub-cases; effectively, the approach
is equivalent to an NFA.

(Derivatives can be used for compiling also. The problem you have to
solve is deciding when two regexes are equivalent, so that you can
close loops in the state machine graph. The better you can do this, the
tighter the graph will be,)

The idea behind a "derivative" is to take an input symbol S and a regex
R, and calculate a new expression R' which matches the rest of the input
after S.

E .g. the derivativre of "abc" with respect to "a" would be "bc".
Likewise with respect to "." (match any single character).

The derivative of "abc" with respect to d would be ∅, the nomatch regex
which describes/matches nothing: empty set of strings. (Not to be
confused with the empty regex that matches the empty string, denoting
the set { "" }).

To match a string of symbols, you simply take successive derivatives. If
you hit ∅, you can short-circuit to the conclusion that the input
doesn't match.

When you run out of input symbols, the remaining derivative must be
nullable: nullable means capable of matching the empty string. If it
isn't, it means the regex wants more characters, so you don't have
a match.

The empty set ∅ regex isn't nullable, needless to say; a regex is
nullable if the language (set of strings) contains the empty string.

Anyway, you can represent regexes in Lisp syntax and then the derivative
operation is just a functional syntax -> syntax calculation (which
generates a lot of garbage as we match strings).

The & operator is easily supported because the derivative operation
readily distributes into the & operatror according to an easy rule.

  d(S, R1 & R2)  -> d(S, R1) & d(S, R2)

Basically you take the derivatives of the branches, and combine them
with & again.

The more familiar | operator distributes the same way.

IN an actual implementation, there are tricks you can pull, like
short-circuit evaluation. If you take the d(S, R1) derivative of the
left side of R1&R2, and that turns out to be ∅, you can short-circuit
to ∅, If you have way to detect that a regular expression A matches all
strings (like .*) then  you know that A&R is equiavlent to R; you can
drop that branch, similarly A|R is A.

You can see that this is amenable to compilation, similar to the subset
construction; you can calculate a derivative with respect to all
possible input symbols and build a graph in which the regex syntax
itself (all the derivatives) are the states. To close loops in the
graph, you need to detect states you have seen before, which requires
equivalence test between two syntax fragments. The fewer false negatives
you have, the tighter the graph.

TXR's regex engine has a front end which checks whether the syntax
contains the "exotic operators" and chooses either the derivative-based
back end or compilation to NFA and simulation.

I haven't bothered implementing better compilation for either back end;
if a collaborator came along to explore those possibilities, that would
be great. Most of that work was done over a decade ago. There is now a
powerful Lisp dialect that would make easy work of writing a
derivative-based regex compiler.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3393

From	George Neuner <gneuner2@comcast.net>
Date	2023-02-20 13:49 -0500
Message-ID	<23-02-054@comp.compilers>
In reply to	#3391

On Mon, 20 Feb 2023 15:09:18 +0000, Roger L Costello
<costello@mitre.org> wrote:

>Hello Compiler Experts!
>
>Scenario: you have a language that has a BNF. You write a statement in
>the language. It is a relatively simple, basic statement. The
>statement conforms to the BNF. To show its conformance, you write the
>derivation of the statement. Surprisingly, deriving the statement
>takes many, many rules. Does that signify that the language's syntax
>is too complex?
>
>/Roger

Not necessarily.

What makes a language complicated (not "complex") is ambiguity, not
the number of grammar rules needed to recognize some particular
expression.

If you are restricted to BNF ... i.e. your tool does not allow
specifying precedence ... then recognizing even relatively simple
arithmetic expressions can (perhaps recursively) involve several
rules.

George

[toc] | [prev] | [next] | [standalone]

#3395

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Date	2023-02-21 20:54 +0200
Message-ID	<23-02-056@comp.compilers>
In reply to	#3389

I want to expand on what
George Neuner <gneuner2@comcast.net> wrote:

> If you are restricted to BNF ... i.e. your tool does not allow
> specifying precedence ... then recognizing even relatively simple
> arithmetic expressions can (perhaps recursively) involve several
> rules.

Without precedence specifications, strict BNF does require nested
rules to be created and named to implement the "natural" hierarchy of
expression rules, e.g. that one does the multiplications before the
additions. This was already recognized in 1973, when Steve Johnson et
al, described how such specifications are implemented in yacc. See
"Deterministic Parsing of Ambiguous Grammars".
(https://dl.acm.org/doi/epdf/10.1145/360933.360969) Of course, the
precedence parsing methods predate that paper, but it showed how one
could easily annotate a BNF grammar to incorporate the idea.

You can take the idea farther if you use numeric preferences for
operators. That gives one a simple way of expressing the ordering of
the operators and one "shifts" operators with higher or equal
precedence and "reduces" when seeing an operator of lower precedence.
(Note you can specify higher by either larger or smaller numbers. E.g.
on a scale of 1 to 10, 1 can be highest, 1st place, or lowest, last
place. One simply has to decide which way one wants it.)

Similarly, in more recent times, the idea of ordering productions to
specify precedence is used in Terence Parr's ANTLR4 to good effect.
Again, the idea is not new. LEX used rule ordering to disambiguate
which regular expression to use (and I suspect it was used before
that).  The same idea also is used in most Parsing Expression
Grammar (PEG) implementations.

Next, there was a paper in the 1980s or 90s whose title and author
I have unfortunately forgotten, although I know it was published in
TOPLAS, that suggested using examples to express preferences.
And, that clearly goes back to the operators used in precedence
parsing, e.g. ".<." which expressed which grouping had higher
precedence and thus who should shift and who should reduce.

But, no matter how one specifies it, the idea is the same.
--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------"

[toc] | [prev] | [standalone]

csiph-web

syntax complexity

Contents

#3385 — syntax complexity

#3386

#3387

#3389

#3390

#3391

#3392

#3394

#3398

#3400 — Re: ireegular expressions, syntax complexity

#3399 — Re: irregular expressions, syntax complexity

#3401 — Re: irregular expressions, syntax complexity

#3393

#3395