Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2402
| Path | csiph.com!3.us.feeder.erje.net!feeder.erje.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end |
|---|---|
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
| Newsgroups | comp.compilers |
| Subject | How make multifinished DFA for merged regexps? |
| Date | Mon, 23 Dec 2019 05:40:43 -0500 |
| Organization | Compilers Central |
| Lines | 59 |
| Sender | news@iecc.com |
| Approved | comp.compilers@iecc.com |
| Message-ID | <19-12-021@comp.compilers> (permalink) |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="UTF-8" |
| Content-Transfer-Encoding | 8bit |
| Injection-Info | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="26633"; mail-complaints-to="abuse@iecc.com" |
| Keywords | DFA, parse |
| Posted-Date | 23 Dec 2019 18:29:13 EST |
| X-submission-address | compilers@iecc.com |
| X-moderator-address | compilers-request@iecc.com |
| X-FAQ-and-archives | http://compilers.iecc.com |
| Xref | csiph.com comp.compilers:2402 |
Show key headers only | View raw
Andy borucki.andrzej@gmail.com wrote: > Greedy algorithms match longest regexp. For example operators "+" and "++", > int numbers "123" and float numbers "123.456e3". > On '.' will finish state of number, but we will inside automata for float > number. But can be errors: after '.' will 'a'. We must backtrack to last > finished state? I want avoid backtracking. Maybe after backtracking we must > read chars from auxiliary token buffer instead of stream up to previous > position? But this complicated parsing. 1) Backtracking is one solution. Using a character buffer (what you called a token buffer) rather than reading the raw input can be used with or without backtracking. Using a character buffer also tends to be efficient as you only call the I/O routines infrequently in that case. Most lexers (such as flex) have used character buffers to speed up their lexing significantly. The only problem is if you mix lexing with reading from the input stream directly, but that is a bad idea in itself. 2) Yet, another solution is to find the cases where errors are caused and use them to make rules that return multiple tokens. E.g. for the case “123.a” make a rule for “123.” That returns two tokens “123” as an int and “.” as a dot (presuming that is a token in its own right). This works well as long as there aren’t infinite strings of characters that toggle back and forth as to what is legal and you can make a finite set of rules that disambiguate what you are lexing. This is not an easy problem to solve. A fictitious language with tokens “a”, “ab”, “b”, and “ba” with longest match rule would find the string “aba” ambiguous as to whether you meant “ab” “a” or “a” “ba” as your two tokens and it is possible only the parser will know which you wanted, or even worse it can be ambiguous even at the parsing level. Generally, people give up on solving the problem. Moreover, in practice, because such language fragments are problematic. They apply the left-most longest rule and declare things like “123.a” to be an error. It is actually a pretty rare problem, because usually the token sequence for “123” “.” and “a” as 3 tokens isn’t legal to the parser. 3) Finally, you can define your tokens to be only unique substrings and put them together in the parser. E.g. “123” is always an “int” token and “.” Is always a “dot” token and there is no “float” token, just a float non-terminal that is “int dot int etc.”. Now, that is also not a panacea. Especially if you have your lexer doing things like throwing away whitespace and comments. On the other hand, I have seen languages that allowed whitespace (and even occasionally comments) inside of tokens. That has more to do with your language specification/design. -- ***************************************************************************** * Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
Back to comp.compilers | Previous | Next | Find similar
How make multifinished DFA for merged regexps? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2019-12-23 05:40 -0500
csiph-web