Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: luser droog <mijoryx@yahoo.com.dmarc.email>
Newsgroups: comp.compilers
Subject: Re: Supporting multiple input syntaxes
Date: Sun, 23 Aug 2020 19:35:00 -0700 (PDT)
Organization: Compilers Central
Lines: 81
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <20-08-014@comp.compilers>
References: <20-08-002@comp.compilers> <20-08-009@comp.compilers> <20-08-010@comp.compilers> <20-08-011@comp.compilers> <20-08-012@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="1290"; mail-complaints-to="abuse@iecc.com"
Keywords: C, parse, comment
Posted-Date: 23 Aug 2020 23:15:04 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
In-Reply-To: <20-08-012@comp.compilers>
Xref: csiph.com comp.compilers:2566

On Sunday, August 23, 2020 at 1:39:30 PM UTC-5, luser droog wrote:
> On Sunday, August 16, 2020 at 10:53:24 AM UTC-5, davidl...@gmail.com wrote:
> > My friend, reporting the furthest position examined by the parser I have [found]
> > useful in error cases as a simple stop gap when using a combinator approach.
> >
> > Thinking about it you kind of want to see the furthest failed position and the
> > stack of rules above it. Such requires meta information when the code is
> > written in the most natural way. For this reason and others I believe it is
> > good to represent your grammar in data structures which is further in the
> > direction of a compiler compiler tool (or compiler interpreter tool).
>
> Thanks. I've done some further investigating. I built my parsers following
> two papers. Hutton and Meijer, Monadic Parser Combinators
>   https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf
> and Hutton, Higher-Order Functions for Parsing
>   https://pdfs.semanticscholar.org/6669/f223fba59edaeed7fabe02b667809a5744d9.pdf
>
> The first adds error reporting using Monad Transformers. [...]>
> But the second paper does it differently, and maybe something I can do
> more easily. It redefines the parsers to no longer produce a list of results,
> so there's no longer support for ambiguity. Then it defines them to
> return a Maybe,
>
>   maybe * ::= Fail [char] | Error [char] | OK *
> .
> where the OK branch has the parse tree, and Fail or Error both contain an error
> message. It describes how a Fail can be transformed into an Error. But it isn't
> entirely clear where the messages get injected.
>
> Still need to do some thinking on it, but I think I can rewrite the parsers
> to follow this model, and then decorate my grammar with possible errors
> at each node.

I've made some progress. I wrote a new prototype following Hutton and
modified it to add position information to the character stream.
And then rewrote the parsers to produce the maybe structure and then
to collect rudimentary error messages.

For these, a positive and negative case in PostScript,

0 0 (abcd\ne) input-pos
(abc) str exec
pc
0 0 (abcd\ne) input-pos
(abd) str nofail exec
pq

I get this output:

$ gsnd -q -dNOSAFER pc11.ps
stack:
[/OK [[(a) [(b) [(c) []]]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]]
stack:
[/Error [[(after) (a)] [[(after) (b)] [[{(d) eq} (not satisfied)] [[(c) [0 2]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]]]]]

So, this indicates that I *can* modify my C parsers to produce error
messages. The remaining input list has the position information for
where the error occurred, [(c) [0 2]].

Following the prototype, I modified the input functions to add positions
for each character and modified the base parser item() to detect and
remove the position stuff before it passes into the rest of the machinery.

The next hurdle was making the extra position information work with
the Unicode filters ucs4_from_utf8() and utf8_from_ucs4(). And that
all appears to be working now.

But that's probably the end of the story for now. Got to gear up for
Operating Systems and Advanced Web Dev with Jave.

Thanks to everyone for the help, esp. Kaz with the brilliant suggestion
to pass a language id token between tokenizer and parser.


Ps. the prototype is written in PostScript extended with function syntax.
https://github.com/luser-dr00g/pcomb/blob/master/ps/pc11.ps
https://codereview.stackexchange.com/questions/193520/an-enhanced-syntax-for-defining-functions-in-postscript

--
l droog
[Why Postscript?  I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]