Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #3385 > unrolled thread
| Started by | gah4 <gah4@u.washington.edu> |
|---|---|
| First post | 2023-02-15 15:08 -0800 |
| Last post | 2023-02-21 20:54 +0200 |
| Articles | 14 — 9 participants |
Back to article view | Back to comp.compilers
syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-15 15:08 -0800
Re: syntax complexity Thomas Koenig <tkoenig@netcologne.de> - 2023-02-16 06:32 +0000
Re: syntax complexity Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2023-02-16 12:03 +0100
Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-16 11:33 -0800
Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-16 16:08 -0800
Re: syntax complexity Roger L Costello <costello@mitre.org> - 2023-02-20 15:09 +0000
Re: syntax complexity gah4 <gah4@u.washington.edu> - 2023-02-20 09:57 -0800
Re: syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-21 08:14 +0000
Re: syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-21 18:39 +0000
Re: ireegular expressions, syntax complexity anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-02-22 10:55 +0000
Re: irregular expressions, syntax complexity arnold@freefriends.org (Aharon Robbins) - 2023-02-22 08:53 +0000
Re: irregular expressions, syntax complexity Kaz Kylheku <864-117-4973@kylheku.com> - 2023-02-23 00:34 +0000
Re: syntax complexity George Neuner <gneuner2@comcast.net> - 2023-02-20 13:49 -0500
syntax complexity Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2023-02-21 20:54 +0200
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2023-02-15 15:08 -0800 |
| Subject | syntax complexity |
| Message-ID | <23-02-045@comp.compilers> |
I started this in another thread, but I think it deserves its own. The question is, how does one measure syntax complexity, with the specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics, for now.) Even back to the 1970's, I always found that Fortran had strange and (seemingly) arbitrary rules on its syntax. Some of them made sense on the small, early, computers, but didn't go away later. Even worse, as new features were added, and some of the rules were relaxed, new ones appeared. On the other hand, PL/I from the beginning had more general rules. And, it seems to me, more general rules lead to simpler syntax. That is especially true for expressions, where PL/I allows pretty much the same expression syntax everywhere it allows expressions. PL/I had internal procedures from the beginning, though only added to Fortran much later. But when they were, Fortran didn't allow them to be nested. Only one level deep! That might have changed more recently, or maybe now two levels. One complication I see, is that syntax complexity as seen by people, might be different from it as seen by programs.
[toc] | [next] | [standalone]
| From | Thomas Koenig <tkoenig@netcologne.de> |
|---|---|
| Date | 2023-02-16 06:32 +0000 |
| Message-ID | <23-02-046@comp.compilers> |
| In reply to | #3385 |
gah4 <gah4@u.washington.edu> schrieb: > I started this in another thread, but I think it deserves its own. > > The question is, how does one measure syntax complexity, with the > specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics, > for now.) A reasonable zeroth-order approximation is the length of the language standard (which would also include semantics). Fortran 2018: 599 pages, without the index. C++ 2020: 1663 pages, without cross-references and indices. C (2020 draft): 393 pages without annexes, 572 pages including the non-normative annexes. It is probably not fair to compare document sizes of early standards like the very first programming language standard, Fortran 66. That was very short at 36 pages, but people were still learning how to write language standards at the time. Another very rough measure would be the size of a front end in the same compiler. Using the even rougher estimate of text lines for gcc, I get $ cat fortran/*.cc fortran/*.h | wc -l 226682 $ cat c/*.cc c/*.h | wc -l 60548 $ cat d/*.cc | wc -l 24850 $ cat ada/*.adb ada/*.ads ada/*.c ada/*.h | wc -l 731635 (Right now, I'm not sure how to identify all the files in the C++ front end. Ada may be misrepresented because the compiler is mostly written in Ada instead of C++, and it is a more verbose language than C++. The Fortran front end is nominally C++, but due to its history mostly sticks to the common subset of C and C++. This is text lines, not statements. I'm not sure if I included the Ada runtime library in that estimate or not. Like I wrote above, a rough estimate).
[toc] | [prev] | [next] | [standalone]
| From | Hans-Peter Diettrich <DrDiettrich1@netscape.net> |
|---|---|
| Date | 2023-02-16 12:03 +0100 |
| Message-ID | <23-02-047@comp.compilers> |
| In reply to | #3385 |
On 2/16/23 12:08 AM, gah4 wrote: > I started this in another thread, but I think it deserves its own. > > The question is, how does one measure syntax complexity, In the original thread I was missing more specific distinctions: - tools: flex/bison, ANTLR, CoCo/R... - parser types: LL, LR, LF, LALR... - or(?): recursive-descent, shift-reduce... - grammar syntax: BNF, EBNF... - grammar types: context-free, -sensitive... > with the > specific case of Fortran vs. PL/I. (And ignoring syntax vs. semantics, > for now.) [...] On that aspect I can not contribute anything, sorry :-( > One complication I see, is that syntax complexity as seen by people, > might be different from it as seen by programs. How do programs recognize syntax complexity? Number of rules and exceptions? DoDi
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2023-02-16 11:33 -0800 |
| Message-ID | <23-02-050@comp.compilers> |
| In reply to | #3387 |
On Thursday, February 16, 2023 at 9:57:53 AM UTC-8, Hans-Peter Diettrich wrote:
> On 2/16/23 12:08 AM, gah4 wrote:
(snip)
> > One complication I see, is that syntax complexity as seen by people,
> > might be different from it as seen by programs.
> How do programs recognize syntax complexity?
> Number of rules and exceptions?
When I wrote that one, I was thinking about the places where Fortran uses
special characters and PL/I uses words.
DO I=1,10,3
DO I = 1 TO 10 BY 3;
I think about them in a different way, such that the thought complexity is different.
A compiler doesn't "think" in that way.
I suppose I agree with the above, the length of the standard, with some
assumptions on how it is written, or the length of the front end.
[Having written a couple of Fortran parsers, I can say that while the hacks
to deal with ignored spaces were ugly, they weren't that hard. PL/I has a
separate issue that the same token might be a keyword or a variable depending
on context, and the kinds of parsers you build with bison et al don't deal
very well with that. -John]
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2023-02-16 16:08 -0800 |
| Message-ID | <23-02-051@comp.compilers> |
| In reply to | #3389 |
On Thursday, February 16, 2023 at 2:46:25 PM UTC-8, gah4 wrote:
(snip)
> When I wrote that one, I was thinking about the places where Fortran uses
> special characters and PL/I uses words.
> DO I=1,10,3
> DO I = 1 TO 10 BY 3;
> I think about them in a different way, such that the thought complexity is different.
> A compiler doesn't "think" in that way.
> I suppose I agree with the above, the length of the standard, with some
> assumptions on how it is written, or the length of the front end.
(and our moderator wrote)
> [Having written a couple of Fortran parsers, I can say that while the hacks
> to deal with ignored spaces were ugly, they weren't that hard. PL/I has a
> separate issue that the same token might be a keyword or a variable depending
> on context, and the kinds of parsers you build with bison et al don't deal
> very well with that. -John]
Well there is that, but so far I was only thinking about the difference
between commas and keywords.
WRITE(2,*) A, B, C
PUT FILE(TWO) SKIP LIST(A, B, C);
or
READ(3'K) X
READ FILE(THREE) INTO(X) KEY(K);
the funny IBM use of a single apostrophe for direct access files.
I had a part of a summer undergrad project working with the STEP
macro processor, which I wrote about some time ago.
I was writing the parser for IBM Fortran (not including leaving
out spaces between words), but there is no way to match a single
apostrophe! Parsing string constants was done at a very low level.
[Like I said the hacks were ugly. For example, this statement
contains a Hollerith constant:
123 FORMAT(4HELLO)
but this one does not:
REAL*4HELLO
I was doing F77 which didn't do that strange quote thing but it's easy
enough to tell from context, since a quoted string has to follow a
punctuation mark that is not a close paren. -John]
[toc] | [prev] | [next] | [standalone]
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2023-02-20 15:09 +0000 |
| Message-ID | <23-02-052@comp.compilers> |
| In reply to | #3389 |
Hello Compiler Experts! Scenario: you have a language that has a BNF. You write a statement in the language. It is a relatively simple, basic statement. The statement conforms to the BNF. To show its conformance, you write the derivation of the statement. Surprisingly, deriving the statement takes many, many rules. Does that signify that the language's syntax is too complex? /Roger
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2023-02-20 09:57 -0800 |
| Message-ID | <23-02-053@comp.compilers> |
| In reply to | #3391 |
On Monday, February 20, 2023 at 9:46:04 AM UTC-8, Roger L Costello wrote: > Scenario: you have a language that has a BNF. You write a statement in the language. > It is a relatively simple, basic statement. The statement conforms to the BNF. > To show its conformance, you write the derivation of the statement. > Surprisingly, deriving the statement takes many, many rules. > Does that signify that the language's syntax is too complex? One suggestion above based it on the size of the standard. That isn't quite as good as it could be, as some standard writers are wordier than others, but maybe not so bad. Another choice is on the size of the BNF. I don't think I would rate it on one statement, but the whole BNF of the language. One statement, such as an assignment statement, could use a lot of rules. Whole language syntax complexity should be on the whole BNF. One complication, depending on how you write BNF. Some statements, such as in PL/I, have items that can be in any order, but you only have at most one of each. That is complicated to write in BNF, but I don't believe complicates the syntax, as seen by people. [There are definitely things that are hard to say in BNF, even though they're intuitively simple. Another is floating point numbers with optional "." and "E" but you need at least one of them. -John]
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2023-02-21 08:14 +0000 |
| Message-ID | <23-02-055@comp.compilers> |
| In reply to | #3392 |
Our moderator writes: >[There are definitely things that are hard to say in BNF, even though >they're intuitively simple. An example is the solution to the dangling-else problem that we discussed some time ago. At least one of the language standards I looked at during this discussion specified an ambiguous grammar for the IF-statement, with the disambiguation coming from prose, rather than specifying an unambiguous grammar. Of course, better than either solution is to design the language to require that an IF-statement is terminated with, e.g., fi (Algol 68) or END (Modula-2). >Another is floating point numbers with >optional "." and "E" but you need at least one of them. -John] Regular expression syntax is missing an operator that signifies the intersection of the sets recognized by the operand regexps. Let's call this operator "&". Then this requirement for an FP number can be expressed as: ([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].* "[0-9.]*" specifies a lenient form of the mantissa part; ".*[0-9].*" specifies a string that has at least one digit; so "([0-9.]*&.*[0-9].*)" says that the mantissa part can contain digits and "." and must contain one digit. "(E[0-9]+)?" specifies the optional exponent part. So "([0-9.]*&.*[0-9].*)(E[0-9]+)" is a lenient form of an FP number that may miss both "." and "E". ".*[.E].*" specifies the requirement stated above: The number must contain a "." or an "E". Again the "&" combines these requirements. You can translate regexps with & to a DFA, and don't lose any performance there. For an NFA-based implementation, the only way I see is that you process both sub-NFAs and check if they both accept the string, so it's somewhat slower; I guess that this is a major reason why such an operator has not been added to widely-used regexp syntaxes. Has anybody implemented such an operator at all? - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/ [Wouldn't that pattern allow 1.2.3 ? -John]
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2023-02-21 18:39 +0000 |
| Message-ID | <23-02-059@comp.compilers> |
| In reply to | #3394 |
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >Regular expression syntax is missing an operator that signifies the >intersection of the sets recognized by the operand regexps. Let's >call this operator "&". Then this requirement for an FP number can be >expressed as: > >([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].* ... >[Wouldn't that pattern allow 1.2.3 ? -John] Good point. The & operator makes it easy to fix this bug (once you have found it): ([0-9.]*&.*[0-9].*)(E[0-9]+)?&.*[.E].*&[^.]*[.]?[^.]* But it suggests that I tried to put too many requirements into the first part, so let's try again: At most one "." in front of at most one "E": [0-9]*[.]?[0-9]*E?[0-9]* At least one of "." or "E": .*[.E].* At least one digit in front of and after "E": .*[0-9].*E[0-9]+ In total: [0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*E[0-9]+ - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2023-02-22 10:55 +0000 |
| Subject | Re: ireegular expressions, syntax complexity |
| Message-ID | <23-02-061@comp.compilers> |
| In reply to | #3398 |
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >At most one "." in front of at most one "E": [0-9]*[.]?[0-9]*E?[0-9]* >At least one of "." or "E": .*[.E].* >At least one digit in front of and after "E": .*[0-9].*E[0-9]+ There's still a bug here, because that makes "E" non-optional. Let's make it optional: At least one digit, when using E one before and one after: .*[0-9].*(E[0-9]+)? In total: [0-9]*[.]?[0-9]*E?[0-9]*&.*[.E].*&.*[0-9].*(E[0-9]+)? - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/ [This doesn't seem all that much simpler than the usual approach with alternation, give or take bugs: ([0-9]+\.[0-9]*|\.[0-9]+)(E[0-9]+)?|[0-9]+E[0-9]+ -John]
[toc] | [prev] | [next] | [standalone]
| From | arnold@freefriends.org (Aharon Robbins) |
|---|---|
| Date | 2023-02-22 08:53 +0000 |
| Subject | Re: irregular expressions, syntax complexity |
| Message-ID | <23-02-060@comp.compilers> |
| In reply to | #3394 |
In article <23-02-055@comp.compilers>, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >Regular expression syntax is missing an operator that signifies the >intersection of the sets recognized by the operand regexps. Let's >call this operator "&". .... > >Has anybody implemented such an operator at all? Doug McIlroy did, if I understand things correctly. See https://github.com/arnoldrobbins/mcilroy-regex/blob/master/README for more info. Arnold
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <864-117-4973@kylheku.com> |
|---|---|
| Date | 2023-02-23 00:34 +0000 |
| Subject | Re: irregular expressions, syntax complexity |
| Message-ID | <23-02-062@comp.compilers> |
| In reply to | #3394 |
On 2023-02-21, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> You can translate regexps with & to a DFA, and don't lose any
> performance there. For an NFA-based implementation, the only way I
> see is that you process both sub-NFAs and check if they both accept
> the string, so it's somewhat slower; I guess that this is a major
> reason why such an operator has not been added to widely-used regexp
> syntaxes.
>
> Has anybody implemented such an operator at all?
I implemented & in the TXR language.
It's based on Brzozowski regex derivatives (described first in 1964,
IIRC).
Derivatives allow a regex syntax with & to be interpreted directly,
and without backtracking or trying sub-cases; effectively, the approach
is equivalent to an NFA.
(Derivatives can be used for compiling also. The problem you have to
solve is deciding when two regexes are equivalent, so that you can
close loops in the state machine graph. The better you can do this, the
tighter the graph will be,)
The idea behind a "derivative" is to take an input symbol S and a regex
R, and calculate a new expression R' which matches the rest of the input
after S.
E .g. the derivativre of "abc" with respect to "a" would be "bc".
Likewise with respect to "." (match any single character).
The derivative of "abc" with respect to d would be ∅, the nomatch regex
which describes/matches nothing: empty set of strings. (Not to be
confused with the empty regex that matches the empty string, denoting
the set { "" }).
To match a string of symbols, you simply take successive derivatives. If
you hit ∅, you can short-circuit to the conclusion that the input
doesn't match.
When you run out of input symbols, the remaining derivative must be
nullable: nullable means capable of matching the empty string. If it
isn't, it means the regex wants more characters, so you don't have
a match.
The empty set ∅ regex isn't nullable, needless to say; a regex is
nullable if the language (set of strings) contains the empty string.
Anyway, you can represent regexes in Lisp syntax and then the derivative
operation is just a functional syntax -> syntax calculation (which
generates a lot of garbage as we match strings).
The & operator is easily supported because the derivative operation
readily distributes into the & operatror according to an easy rule.
d(S, R1 & R2) -> d(S, R1) & d(S, R2)
Basically you take the derivatives of the branches, and combine them
with & again.
The more familiar | operator distributes the same way.
IN an actual implementation, there are tricks you can pull, like
short-circuit evaluation. If you take the d(S, R1) derivative of the
left side of R1&R2, and that turns out to be ∅, you can short-circuit
to ∅, If you have way to detect that a regular expression A matches all
strings (like .*) then you know that A&R is equiavlent to R; you can
drop that branch, similarly A|R is A.
You can see that this is amenable to compilation, similar to the subset
construction; you can calculate a derivative with respect to all
possible input symbols and build a graph in which the regex syntax
itself (all the derivatives) are the states. To close loops in the
graph, you need to detect states you have seen before, which requires
equivalence test between two syntax fragments. The fewer false negatives
you have, the tighter the graph.
TXR's regex engine has a front end which checks whether the syntax
contains the "exotic operators" and chooses either the derivative-based
back end or compilation to NFA and simulation.
I haven't bothered implementing better compilation for either back end;
if a collaborator came along to explore those possibilities, that would
be great. Most of that work was done over a decade ago. There is now a
powerful Lisp dialect that would make easy work of writing a
derivative-based regex compiler.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
[toc] | [prev] | [next] | [standalone]
| From | George Neuner <gneuner2@comcast.net> |
|---|---|
| Date | 2023-02-20 13:49 -0500 |
| Message-ID | <23-02-054@comp.compilers> |
| In reply to | #3391 |
On Mon, 20 Feb 2023 15:09:18 +0000, Roger L Costello <costello@mitre.org> wrote: >Hello Compiler Experts! > >Scenario: you have a language that has a BNF. You write a statement in >the language. It is a relatively simple, basic statement. The >statement conforms to the BNF. To show its conformance, you write the >derivation of the statement. Surprisingly, deriving the statement >takes many, many rules. Does that signify that the language's syntax >is too complex? > >/Roger Not necessarily. What makes a language complicated (not "complex") is ambiguity, not the number of grammar rules needed to recognize some particular expression. If you are restricted to BNF ... i.e. your tool does not allow specifying precedence ... then recognizing even relatively simple arithmetic expressions can (perhaps recursively) involve several rules. George
[toc] | [prev] | [next] | [standalone]
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Date | 2023-02-21 20:54 +0200 |
| Message-ID | <23-02-056@comp.compilers> |
| In reply to | #3389 |
I want to expand on what George Neuner <gneuner2@comcast.net> wrote: > If you are restricted to BNF ... i.e. your tool does not allow > specifying precedence ... then recognizing even relatively simple > arithmetic expressions can (perhaps recursively) involve several > rules. Without precedence specifications, strict BNF does require nested rules to be created and named to implement the "natural" hierarchy of expression rules, e.g. that one does the multiplications before the additions. This was already recognized in 1973, when Steve Johnson et al, described how such specifications are implemented in yacc. See "Deterministic Parsing of Ambiguous Grammars". (https://dl.acm.org/doi/epdf/10.1145/360933.360969) Of course, the precedence parsing methods predate that paper, but it showed how one could easily annotate a BNF grammar to incorporate the idea. You can take the idea farther if you use numeric preferences for operators. That gives one a simple way of expressing the ordering of the operators and one "shifts" operators with higher or equal precedence and "reduces" when seeing an operator of lower precedence. (Note you can specify higher by either larger or smaller numbers. E.g. on a scale of 1 to 10, 1 can be highest, 1st place, or lowest, last place. One simply has to decide which way one wants it.) Similarly, in more recent times, the idea of ordering productions to specify precedence is used in Terence Parr's ANTLR4 to good effect. Again, the idea is not new. LEX used rule ordering to disambiguate which regular expression to use (and I suspect it was used before that). The same idea also is used in most Parsing Expression Grammar (PEG) implementations. Next, there was a paper in the 1980s or 90s whose title and author I have unfortunately forgotten, although I know it was published in TOPLAS, that suggested using examples to express preferences. And, that clearly goes back to the operators used in precedence parsing, e.g. ".<." which expressed which grouping had higher precedence and thus who should shift and who should reduce. But, no matter how one specifies it, the idea is the same. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------"
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web