Path: csiph.com!goblin3!goblin.stu.neva.ru!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Reachability of DFA part Date: Mon, 23 Dec 2019 05:52:50 -0500 Organization: Compilers Central Lines: 126 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-12-022@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="58894"; mail-complaints-to="abuse@iecc.com" Keywords: parse Posted-Date: 23 Dec 2019 21:52:40 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2403 Andy borucki.andrzej@gmail.com wrote: > Good definition would be {[^}]*} > Complexity of problem increases when comment ends with string > len >1, for example C: */ or Pascal *) > if we renaming : /->a *->b other->c > then bad definition will ab(a|b|b)*ba and good definition is > complicated: > ab(b|(a|c)*b*)*a (if I not make mistake) > Commments should maybe be defined in other way, especially > comments can be nested in Object Pascal. > Comment nesting can using stack or simply counter. I > see, in Pascal is using counter. > Difference: Pascal has two types of multiline comments { } and (* *) I'm not sure what kind of language you are trying to lex/parse, but the problems you are mentioning all have known solutions. You are generally right that {.*} is not good definition of comments. Your [^x] construct is a typical solution for single character terminators of comments. However, It does get more complicated if you have multiple character terminators. You end up having to make several [^x], [^y], [^xz], etc sets. Making them by hand is tedious, challenging, and error prone. We adopted a better solution in Yacc++, we have an @ symbol, that works something like “.”, but doesn’t match characters that will otherwise be matched in the state. Here are some examples of how it works. Comment: “//” @* “\n”; // @ here is simply [^\n] Except if you are in something like Unicode, and \n is a set like [\0^j^m] you can terminate a line with any of the three characters a null byte, a linefeed, or a carriage return. In which case, @ is [^\0^j^m], anything not in that set. Note the actual Unicode definition is even more complex but this is sufficient for this example. How about simple C style comments: Comment: “/*” (@* “*”)+ “/”; Here we find @ used in two states: In the first state, it matches [^*]. In the second it matches [^*/]. What if you want to count the newlines in your comment? Comment: “/*” ((“\n” { ++newline} | @)* “*”)+ “/”; Now @ matches [^\n*] and [^\n*/]. Or maybe you want to treat \* specially Comment: “/*” ((“\\*” | @)* “*”)+ “/”; You have to double backslashes as our grammar treats them like C/C++/C# does, but that’s just our notation, not something you have to follow. In this case @ matches [^\*] and [^\*/]. Oh, but maybe you want \\ special too. Comment: “/*” ((“\\*” \ “\\\\” | @)* “*”)+ “/”; Finally, nested comments: Comment: “/*” (Comment | @)* “*”)+ “/”; In this case @ matches [^*/] in both states, but there are still 2 states. Now, nested comments aren’t actually a regular language and we “parse” them in our lexer using an LR stack. And, of course, you can mix all those things. Most importantly, whatever you do, because of the semantics of @ (match any “character” not otherwise matched in the lexer state) you always get out the right definition. You don’t have to figure out the sets you need. The generator figures them out for you. In my not so humble opinion, having the generator do that for you is the right answer. It is the kind of thing where automation is better than humans. ---------- Similarly, your question about sets of characters (what you call fragments) is a known problem with multiple solutions. It is basically a packing problem and how you want your tables for your lexer compressed. It shouldn’t matter whether the sets are disjoint or sparse. Although, the techniques by which you compress them may vary depending upon those characteristics. If you have a “nice” set of fragments, you can use an upfront table to compress them. Map the incoming characters into the sets you need and use those sets as indexes into your lexing tables. If the differing states tend to need different mapping, you may want to have the “current lexer state” as an argument into the mapping table. You can also use the “yacc” technique of having a list of special cases and a default case when none of the special cases apply. There are several ways to use that. An actual list you search (or maybe binary search). A bit map where one bits indicate a special case applies. Etc. Note, that you can make this work either with simple characters (i.e. all characters are the same width, whatever that width maybe) or multi-byte style characters (ala UTF-8, where some characters are one length and others different lengths). All of that can be done before you hit the lexer, or in the lexer itself. Again, you shouldn’t be trying to do this by hand. It should be something that you have software (the generator) taking care of. -- ***************************************************************************** * Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris -----------------------------------------------------------------------------