Path: csiph.com!eternal-september.org!feeder.eternal-september.org!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Hans-Peter Diettrich Newsgroups: comp.compilers Subject: Re: Reachability of DFA part Date: Sat, 21 Dec 2019 19:58:27 +0100 Organization: Compilers Central Lines: 63 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-12-014@comp.compilers> References: <19-12-008@comp.compilers> <19-12-009@comp.compilers> <19-12-012@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="91817"; mail-complaints-to="abuse@iecc.com" Keywords: lex, Pascal Posted-Date: 21 Dec 2019 14:04:56 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2395 Am 21.12.2019 um 10:15 schrieb Andy: > W dniu piątek, 20 grudnia 2019 17:41:04 UTC+1 użytkownik Kaz Kylheku > napisał: > Machine finally accept all strings begin from "ab" but "ba" will unused. > This is similar to definition of comment: in Pascal. comment begin at { and > end of }, careless definition is {*} which mark as comment to rest of file. Pascal is a too general term, with no special implementation implied. A Pascal lexer typically is handcrafted or generated by CoCo/r, not with lex/yacc or regular expressions. > Good definition would be {[^}]*} > Complexity of problem increases when comment ends with string len >1, for > example C: */ or Pascal *) Digraphs are a problem with many old languages, including C. They are intended as direct replacements for other characters, conversion performed before or in tokenization. > if we renaming : /->a *->b other->c > then bad definition will ab(a|b|b)*ba and good definition is complicated: > ab(b|(a|c)*b*)*a (if I not make mistake) > > Commments should maybe be defined in other way, especially comments can be > nested in Object Pascal. Comment nesting can using stack or simply counter. > I see, in Pascal is using counter. Difference: Pascal has two types of multiline > comments { } and (* *) For digraphs see above. Nested comments are problematic only with single pass lexer generators. With multiple stages for character substitution, whitespace etc. no problems are known with Pascal tokens. > If we use stack, closing comment type must be equal last open comment type, > for counter - only count comments of type first opening, example > { { (* } *) } This again is implementation specific. If you mean scannerless parsers with embedded regular expressions, they have several problems with traditional languages. Traditional tokenization has minor known problems with whitespace (including comments), numbers and identifiers, which have been discussed and solved since long. All other tokens are literals which deserve no sophisticated lexer. In Pascal/Delphi even the dot tokens '.', '..', '...' don't cause problems, unlike C where '..' is missing. IMO a language should be constructed for easy compilation, with simple terminal definitions for handcrafted lexers, or with a fully specified conflict free formal token grammar, the latter type either with a separate or embedded lexer. The first type also is human readable, while the second type tends to result in write-only languages or describes non-verbal grammars like for DNA. What's a formal token definition worth if it cannot be proofed error free? DoDi