Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: RE: How do you create a grammar for a multi-language language? Date: Sun, 6 Mar 2022 15:37:13 +0200 Organization: Compilers Central Lines: 90 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-010@comp.compilers> References: <22-03-004@comp.compilers> <22-03-006@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="89298"; mail-complaints-to="abuse@iecc.com" Keywords: parse, syntax Posted-Date: 06 Mar 2022 12:06:31 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2915 At first glance, I misread the question and thought about using inheritance to do so. We used that in Yacc++ (the one from compiler resources, not the one that comes with your C++ compiler) to solve certain multi-language problems. However, you are actually solving a "simpler" problem. And, the standard approach to that is to embed the second language as a "string" in the outer language. Many languages (and their compilers/interpreters) do that. That's exactly what your XSLT case does. The XPATH code is simply a string in the XSLT language, and the XSLT language doesn't attempt to parse it. It simply hands the code off to an XPATH parser when in knows the string is XPATH code. Now, the only problem with that has to do with nested strings. If your inner language has strings, that probably gives you the quoting/backslash problem. Many shells have this issue where to nest strings one needs to add backslashes (and double existing backslashes) resulting in a terrible counting problem as to when one has enough (and not too many) of them. foo = "a \"nested\" string, worse one with an \\\"extra\\\" level of nesting, imaging doing that \\\\\"more than once\\\\\" and getting them to line up, recursive \\s are a pain" letsPlay = "whose newline \\n is it \n" // a game where the rules are made up and the formatting doesn't matter (or does it) An alternative to that is often "raw strings" where the outer language has a syntax where the outer delimiter says within it, ignore quotes and backslashes etc, and accept anything up to the matching closing outer delimiter. The best form of these allows one to "tag" the delimiter with a key that the closing delimiter must include and match. That way you can use the key to keep the closing delimiter from being "reserved" in the inner text. Something like this: foo = """xyzzy flk" \n \" \\\ js // none of these are special and used verbatim sdflkjs ssfs ''''' xlkjxlvj // not a closing """ because it doesn't have the key xyzzy sflsfj ''''xyzzy; // now we closed the string Note that matching the outer delimiter plus the tag might be beyond the capacity of your lexer, especially if it only does regular expressions. A different alternative is to make "strings" that are special in your language and switch grammars. For example, I like / as a string delimiter for regular expressions. Thus /regex/ is a "string" in the outer language, it doesn't parse its contents, but it also knows that the string is a regular expression and should be handled as one. Here is an example of that. foo = /ab.c*([abc]|xb+)*/; We actually use a variant of that in Yacc++. We treat {code} as a string in Yacc++. We don't parse the code. We send it to the C++ or C# (et al) compiler. However, we have modified the lexer (using an LR rule rather than just a regular expression) so that we do a little bit of parsing to allow nested { } pairs in the code and comments. That takes a little work to do. In Yacc++, we made that easy by merging regular expressions and LR rules into a unified model, but most lexer and parser generators don't do that. (A shame in my opinion, but that's just *my* *opinion*.) foo : bar { if (x == y) { a = b } /* a C style comment with a /* inside, C comments don't nest per spec */ while (a < b) { a = a << 1; // C++ comment with a } inside } // end while } /* close code */ bletch; The disadvantage of doing that in the lexer is that we cannot conveniently use { } pairs in the "outer language" syntax. Not without lexer states or the equivalent. But, notice in all these methods we don't actually do (much) parsing of the string. We don't try to make "one" grammar out of them, because what's really hard to do (unless the languages are quite similar, and especially have the same tokenizing rules) is to embed the second language as something other than some kind of string. It can be done, but it can be a lot of work and fragile. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------