Path: csiph.com!goblin3!goblin.stu.neva.ru!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Reachability of DFA part
Date: Mon, 23 Dec 2019 05:52:50 -0500
Organization: Compilers Central
Lines: 126
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <19-12-022@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="58894"; mail-complaints-to="abuse@iecc.com"
Keywords: parse
Posted-Date: 23 Dec 2019 21:52:40 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2403

Andy borucki.andrzej@gmail.com wrote:

> Good definition would be {[^}]*}
> Complexity of problem increases when comment ends with string
> len >1, for example C: */ or Pascal *)
> if we renaming : /->a *->b other->c
> then bad definition will ab(a|b|b)*ba and good definition is
> complicated:
> ab(b|(a|c)*b*)*a (if I not make mistake)

> Commments should maybe be defined in other way, especially
> comments can be nested in Object Pascal.
> Comment nesting can using stack or simply counter. I
> see, in Pascal is using counter.
> Difference: Pascal has two types of multiline comments { } and (* *)

I'm not sure what kind of language you are trying to lex/parse, but
the problems you are mentioning all have known solutions.

You are generally right that {.*} is not good definition of comments.
Your [^x] construct is a typical solution for single character
terminators of comments.  However, It does get more complicated if you
have multiple character terminators.  You end up having to make
several [^x], [^y], [^xz], etc sets.  Making them by hand is tedious,
challenging, and error prone.

We adopted a better solution in Yacc++, we have an @ symbol, that
works something like “.”, but doesn’t match characters that will
otherwise be matched in the state.  Here are some examples of how it
works.

Comment: “//” @* “\n”;    // @ here is simply [^\n]

 Except if you are in something like Unicode, and \n is a set like
[\0^j^m] you can terminate a line with any of the three characters a
null byte, a linefeed, or a carriage return.  In which case, @ is
[^\0^j^m], anything not in that set.  Note the actual Unicode
definition is even more complex but this is sufficient for this
example.

How about simple C style comments:

Comment: “/*” (@*  “*”)+ “/”;

 Here we find @ used in two states:  In the first state, it matches
[^*]. In the second it matches [^*/].

 What if you want to count the newlines in your comment?

 Comment: “/*” ((“\n” { ++newline} | @)*  “*”)+ “/”;

Now @ matches [^\n*] and [^\n*/].

Or maybe you want to treat \* specially

Comment: “/*” ((“\\*” | @)*  “*”)+ “/”;

You have to double backslashes as our grammar treats them like
C/C++/C# does, but that’s just our notation, not something you have to
follow.

In this case @ matches [^\*] and [^\*/].

Oh, but maybe you want \\ special too.

Comment: “/*” ((“\\*” \ “\\\\” | @)*  “*”)+ “/”;

Finally, nested comments:

Comment: “/*” (Comment | @)*  “*”)+ “/”;

In this case @ matches [^*/] in both states, but there are still 2 states.
Now, nested comments aren’t actually a regular language and we “parse”
them in our lexer using an LR stack.

And, of course, you can mix all those things.  Most importantly,
whatever you do, because of the semantics of @ (match any “character”
not otherwise matched in the lexer state) you always get out the right
definition.  You don’t have to figure out the sets you need.  The
generator figures them out for you.

In my not so humble opinion, having the generator do that for you is
the right answer.  It is the kind of thing where automation is better
than humans.

----------

Similarly, your question about sets of characters (what you call
fragments) is a known problem with multiple solutions.   It is
basically a packing problem and how you want your tables for your
lexer compressed.  It shouldn’t matter whether the sets are disjoint
or sparse.    Although, the techniques by which you compress them may
vary depending upon those characteristics.

If you have a “nice” set of fragments, you can use an upfront table to
compress them.  Map the incoming characters into the sets you need and
use those sets as indexes into your lexing tables.

 If the differing states tend to need different mapping, you may want
to have the “current lexer state” as an argument into the mapping
table.

 You can also use the “yacc” technique of having a list of special
cases and a default case when none of the special cases apply. There
are several ways to use that.  An actual list you search (or maybe
binary search).  A bit map where one bits indicate a special case
applies.  Etc.

Note, that you can make this work either with simple characters (i.e.
all characters are the same width, whatever that width maybe) or
multi-byte style characters (ala UTF-8, where some characters are one
length and others different lengths).    All of that can be done
before you hit the lexer, or in the lexer itself.

 Again, you shouldn’t be trying to do this by hand.  It should be
something that you have software (the generator) taking care of.

--
*****************************************************************************
*
Chris Clark                  email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
-----------------------------------------------------------------------------