Path: csiph.com!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: What stage should entities be resolved? Date: Sat, 12 Mar 2022 14:11:21 +0200 Organization: Compilers Central Lines: 36 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <22-03-028@comp.compilers> References: <22-03-019@comp.compilers> <22-03-025@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="8890"; mail-complaints-to="abuse@iecc.com" Keywords: parse, design Posted-Date: 14 Mar 2022 11:36:04 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2931 Contrary to what might assume from my previous posting on this topic. I agree with Dodi. Sometimes, the right answer is another phase. To keep your lexer simple, it can be useful to have a separate phase that deals with "character" issues, whether that is transforming UTF-8 extensions into unique code points (or actual characters representing glyphs possibly accented, i.e. resolving the combining code points into canonical versions) or taking sequences like & or \n or whatever into single tokens (or characters). That *can* make the whole process simpler and faster. For example, years ago when working on a C compiler for Honeywell when the first ANSI standard was still new, the standard had 8 stages (if I recall correctly) that described the lexing process. We decided that the best way to assure faithfulness to the standard was to implement the 8 stages exactly as specified, at least in the first version. That way we had a reliable model of the desired behavior that we could track back to the standard. Moreover, by having them as separate pieces of code, it was easy to turn them off (e.g. trigraphs in C were an ANSI invention and some C programs used ??? not as a trigraph but as a way of emphasis). Similarly, some pre-ANSI C dialects supported nested comments and you might want to change that phase. While you do want each phase to generally build larger and larger structures. I.e. you don't want your parser very often dealing with strings as individual characters. The exact number of phases or content of each phase can vary slightly. One size rarely fits all. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------