Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > comp.compilers > #3425
| From | Kaz Kylheku <864-117-4973@kylheku.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: A simpler way to tokenize and parse? |
| Date | 2023-03-26 01:17 +0000 |
| Organization | A noiseless patient Spider |
| Message-ID | <23-03-023@comp.compilers> (permalink) |
| References | <23-03-011@comp.compilers> |
On 2023-03-24, Roger L Costello <costello@mitre.org> wrote: > Example of tokenizing/parsing using read: > > (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4)) You've not quite hit upon how it works, and I'd encourage you to keep exploring. Read takes the seven characters (+ 3 4) and returns an object which stands for the same thinig. When Lisp programmers discuss that object, they refer to it using the same notation (+ 3 4). Actual copy-paste from a Lisp session: [1]> (read-from-string "(+ 3 4)") (+ 3 4) ; 7 The second return value of read-from-string, 7, isn't the value of the expression; it's the position of the first character of the string which was not read. Our expression is seven characters long. > > The first expression (+ 3 4) is the concrete syntax. > The middle expression (list `+ 3 4) is an s-expression. It is an intermediate > representation. "S-expression" actually refers to the character syntax. The object in memory is just an expression. The reader in Lisps like Scheme and Common Lisp perpetrates no such embellishment. The symbol "list" and quotation around the + will not appear from reading "(+ 3 7)". You get a three-element list, made up out of three cons cells (pair-like objects), whose elements are strictly those that are implied by the read syntax: the + symbol and the two numbers. > The last expression (add (num 3) (num 4)) is the abstract syntax. No such thing is user-visible in any mainstream Lisp. Lisp interpreters directly evaluate the (+ 3 4) object. Lisp compilers potentially build some annotated syntax tree, but this is not a documented feature of any Lisp that I know; it will be an internal matter. Compiling the raw (+ 3 4) form is perfectly possible. > > The book says: read is one of the great ideas of computer science. It helps > decompose a fundamentally difficult process - generalized parsing of the input > stream - into two simple processes: > > (1) reading the input stream into an intermediate representation > (2) parsing that intermediate representation The bigger idea in Lisp is actually "print-read consistency": that objects have a printed notation that the machine can produce, which the machine can read to reproduce a similar object. Not all objects have print-read consistency in Lisp, but things are usualy strict int he mature Lisp dialects. If something doesn't have print-read consistency, it will print in an unreadable form that generates an error. In Common Lisp, the character sequence #< (sharpsign less-than), in the standard read-table, signals an error. Objects which don't have a printed notation that can be read can use that syntax, e.g. #<socket-handle 10.1.2.3:8080>. > I've read several compiler books and none of them talked about this. They talk > about creating a lexer to generate a stream of tokens and a parser that > receives the tokens and arranges them into a tree data structure. Why no > mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of > the great ideas of computer science"? It's because we are not in a branch of the parallel universe in which a lot of people know about and program in Lisp. The Lisp microcosm has a lot to say on many topics, but computing is largely ignorant of it. > I have done some work with Flex and Bison and recently I've done some work > with building parsers using read. My experience is the latter is much easier. > Why isn't read more widely discussed and used in the compiler community? > Surely the concept that read embodies is not specific to Lisp and Scheme, > right? S-expressions do crop up outside of Lisp. The IMAP4 protocol uses them. The GNU C compiler uses a form of S-expression internally. Look up RTL: https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL The Rational Rose object design tool stores files in a S-expression format called Petal. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca
Back to comp.compilers | Previous | Next — Previous in thread | Find similar
A simpler way to tokenize and parse? Roger L Costello <costello@mitre.org> - 2023-03-24 14:45 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-25 11:55 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? gah4 <gah4@u.washington.edu> - 2023-03-25 14:32 -0700
Re: Lisp syntax, was A simpler way to tokenize and parse? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-03-25 13:14 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 00:46 +0000
Re: A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-25 19:58 +0100
Re: A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-26 14:10 +0000
Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 18:19 +0000
Re: Lisp syntax, A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-27 23:15 +0200
Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 01:17 +0000
csiph-web