Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: RE: How do you create a grammar for a multi-language language?
Date: Sun, 6 Mar 2022 15:37:13 +0200
Organization: Compilers Central
Lines: 90
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-010@comp.compilers>
References: <22-03-004@comp.compilers> <22-03-006@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="89298"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, syntax
Posted-Date: 06 Mar 2022 12:06:31 EST
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2915

At first glance, I misread the question and thought about using
inheritance to do so.  We used that in Yacc++ (the one from compiler
resources, not the one that comes with your C++ compiler) to solve
certain multi-language problems.

However, you are actually solving a "simpler" problem.  And, the
standard approach to that is to embed the second language as a
"string" in the outer language.  Many languages (and their
compilers/interpreters) do that.  That's exactly what your XSLT case
does.  The XPATH code is simply a string in the XSLT language, and the
XSLT language doesn't attempt to parse it.  It simply hands the code
off to an XPATH parser when in knows the string is XPATH code.

Now, the only problem with that has to do with nested strings.  If
your inner language has strings, that probably gives you the
quoting/backslash problem.  Many shells have this issue where to nest
strings one needs to add backslashes (and double existing backslashes)
resulting in a terrible counting problem as to when one has enough
(and not too many) of them.

foo = "a \"nested\" string, worse one with an \\\"extra\\\" level of
nesting, imaging doing that \\\\\"more than once\\\\\" and getting
them to line up, recursive \\s are a pain"
letsPlay = "whose newline \\n is it \n" // a game where the rules are
made up and the formatting doesn't matter (or does it)

An alternative to that is often "raw strings" where the outer language
has a syntax where the outer delimiter says within it, ignore quotes
and backslashes etc, and accept anything up to the matching closing
outer delimiter.  The best form of these allows one to "tag" the
delimiter with a key that the closing delimiter must include and
match.  That way you can use the key to keep the closing delimiter
from being "reserved" in the inner text.  Something like this:

foo = """xyzzy
flk" \n \" \\\ js // none of these are special and used verbatim
sdflkjs
  ssfs ''''' xlkjxlvj  // not a closing """ because it doesn't have
the key xyzzy
sflsfj
''''xyzzy; // now we closed the string

Note that matching the outer delimiter plus the tag might be beyond
the capacity of your lexer, especially if it only does regular
expressions.

A different alternative is to make "strings" that are special in your
language and switch grammars.  For example, I like / as a string
delimiter for regular expressions.
Thus /regex/ is a "string" in the outer language, it doesn't parse its
contents, but it also knows that the string is a regular expression
and should be handled as one.
Here is an example of that.

foo = /ab.c*([abc]|xb+)*/;

We actually use a variant of that in Yacc++.  We treat {code} as a
string in Yacc++.  We don't parse the code.  We send it to the C++ or
C# (et al) compiler.  However, we have modified the lexer (using an LR
rule rather than just a regular expression) so that we do a little bit
of parsing to allow nested { } pairs in the code and comments.  That
takes a little work to do.  In Yacc++, we made that easy by merging
regular expressions and LR rules into a unified model, but most lexer
and parser generators don't do that.  (A shame in my opinion, but
that's just *my* *opinion*.)

foo : bar { if (x == y) { a = b } /* a C style comment with a /*
inside, C comments don't nest per spec */
   while (a < b) { a = a << 1; // C++ comment with a } inside
       } // end while
} /* close code */ bletch;

The disadvantage of doing that in the lexer is that we cannot
conveniently use { } pairs in the "outer language" syntax.  Not
without lexer states or the equivalent.

But, notice in all these methods we don't actually do (much) parsing
of the string.  We don't try to make "one" grammar out of them,
because what's really hard to do (unless the languages are quite
similar, and especially have the same tokenizing rules) is to embed
the second language as something other than some kind of string.  It
can be done, but it can be a lot of work and fragile.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------