Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #3413 > unrolled thread
| Started by | Roger L Costello <costello@mitre.org> |
|---|---|
| First post | 2023-03-24 14:45 +0000 |
| Last post | 2023-03-26 01:17 +0000 |
| Articles | 10 — 6 participants |
Back to article view | Back to comp.compilers
A simpler way to tokenize and parse? Roger L Costello <costello@mitre.org> - 2023-03-24 14:45 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-25 11:55 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? gah4 <gah4@u.washington.edu> - 2023-03-25 14:32 -0700
Re: Lisp syntax, was A simpler way to tokenize and parse? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-03-25 13:14 +0000
Re: Lisp syntax, was A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 00:46 +0000
Re: A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-25 19:58 +0100
Re: A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-26 14:10 +0000
Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 18:19 +0000
Re: Lisp syntax, A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-27 23:15 +0200
Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 01:17 +0000
| From | Roger L Costello <costello@mitre.org> |
|---|---|
| Date | 2023-03-24 14:45 +0000 |
| Subject | A simpler way to tokenize and parse? |
| Message-ID | <23-03-011@comp.compilers> |
Hello Compiler Experts! I am reading the book, "Programming Languages, Application and Interpretation" by Shriram Krishnamurthi. The book says that Lisp and Scheme have a primitive called "read". The book says, "The read primitive is a crown jewel of Lisp and Scheme." Some of my notes from reading the book: - Read does tokenizing and reading. - Read returns a value known as an s-expression. - The s-expression is an intermediate representation. - The output of read is either a number or a list. That's it! Example of tokenizing/parsing using read: (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4)) The first expression (+ 3 4) is the concrete syntax. The middle expression (list `+ 3 4) is an s-expression. It is an intermediate representation. The last expression (add (num 3) (num 4)) is the abstract syntax. The book says: read is one of the great ideas of computer science. It helps decompose a fundamentally difficult process - generalized parsing of the input stream - into two simple processes: (1) reading the input stream into an intermediate representation (2) parsing that intermediate representation I've read several compiler books and none of them talked about this. They talk about creating a lexer to generate a stream of tokens and a parser that receives the tokens and arranges them into a tree data structure. Why no mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of the great ideas of computer science"? I have done some work with Flex and Bison and recently I've done some work with building parsers using read. My experience is the latter is much easier. Why isn't read more widely discussed and used in the compiler community? Surely the concept that read embodies is not specific to Lisp and Scheme, right? /Roger [Yes, it's specific to Lisp and Scheme. They have an extremely simple symtax called S expressions of nested parenthesized lists of space separated tokens with some quoting. The original plan was that Lisp 2 would have M expressions that looked more like a normal language but it's over 50 years later and they still haven't gotten around to it. -John]
[toc] | [next] | [standalone]
| From | Spiros Bousbouras <spibou@gmail.com> |
|---|---|
| Date | 2023-03-25 11:55 +0000 |
| Subject | Re: Lisp syntax, was A simpler way to tokenize and parse? |
| Message-ID | <23-03-016@comp.compilers> |
| In reply to | #3413 |
On Fri, 24 Mar 2023 14:45:40 +0000 Roger L Costello <costello@mitre.org> wrote: > Hello Compiler Experts! > > I am reading the book, "Programming Languages, Application and Interpretation" > by Shriram Krishnamurthi. > > The book says that Lisp and Scheme have a primitive called "read". > > The book says, "The read primitive is a crown jewel of Lisp and Scheme." > > Some of my notes from reading the book: > > - Read does tokenizing and reading. > - Read returns a value known as an s-expression. > - The s-expression is an intermediate representation. > - The output of read is either a number or a list. That's it! For educational examples perhaps it's only a number or a list. But for real world usage it has to be more. In Common Lisp it can be a string or a symbol (an identifier more or less) or a vector or a number of other Common Lisp types. If the programmer defines his own classes (which then count as new types) then automatically syntax is created to be able to read and return such objects too. [...] > I've read several compiler books and none of them talked about this. They talk > about creating a lexer to generate a stream of tokens and a parser that > receives the tokens and arranges them into a tree data structure. Why no > mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of > the great ideas of computer science"? > > I have done some work with Flex and Bison and recently I've done some work > with building parsers using read. My experience is the latter is much easier. > Why isn't read more widely discussed and used in the compiler community? Probably because there really isn't much to say. It's straightforward to parse so if it works for your needs then you don't need to read any compiler books about it. > Surely the concept that read embodies is not specific to Lisp and Scheme, > right? It is specific to when you have a very simple and uniform syntax and experience suggests that this isn't to most people's taste. Whether it is a result of "mental wiring" or tradition (including mathematical tradition) to which one gets exposed from a young age , I don't know. What I mean by this is that most people seem to find it easier to read a + b * c as opposed to (+ a (* b c)) and I don't know if this is just the result of early exposure or an inherent part of how most humans' brains function. Another issue is that sometimes people have to turn mathematical notation in computer programmes and it is an extra step to transform a + b * c to (+ a (* b c)) regardless of which one finds easier to read in isolation. In mathematical logic the formal syntax also specifies a uniform and simple notation , usually based on parentheses , but even there authors immediately introduce conventions about operator precedence so that you don't have to read (and they don't have to type) so many parentheses. So what in formal syntax would be for example ((A ∧ B) → C) becomes A ∧ B → C where ∧ is specified to have higher precedence than → . I note that Forth also has a very simple and uniform syntax and again Forth isn't very popular. Moderator wrote: > /Roger > [Yes, it's specific to Lisp and Scheme. They have an extremely simple > symtax called S expressions of nested parenthesized lists of space > separated tokens with some quoting. The original plan was that Lisp 2 > would have M expressions that looked more like a normal language but > it's over 50 years later and they still haven't gotten around to it. Actually Common Lisp has gotten around to it. I have seen Common Lisp libraries which create a more conventional syntax for Common Lisp and even claim to retain the power of writing macros. But I've never paid much attention because the usual Common Lisp syntax works fine for me. So I can't provide links , perhaps someone on comp.lang.lisp will know. I don't think that such libraries have seen much if any use. From what I recall , even their authors did not claim that they prefer the different syntax but they were simply hoping that with a more conventional syntactic wrapper Common Lisp (or some Lisp) would become more popular ; or perhaps they saw it as an interesting intellectual exercise. -- vlaho.ninja/prog
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2023-03-25 14:32 -0700 |
| Subject | Re: Lisp syntax, was A simpler way to tokenize and parse? |
| Message-ID | <23-03-020@comp.compilers> |
| In reply to | #3418 |
On Saturday, March 25, 2023 at 7:48:45 AM UTC-7, Spiros Bousbouras wrote: (snip) > It is specific to when you have a very simple and uniform syntax and experience > suggests that this isn't to most people's taste. Whether it is a result of > "mental wiring" or tradition (including mathematical tradition) to which one > gets exposed from a young age , I don't know. What I mean by this is that most > people seem to find it easier to read > a + b * c > as opposed to > (+ a (* b c)) > > and I don't know if this is just the result of early exposure or an inherent > part of how most humans' brains function. Years ago, when there was actual competition between TI and HP for calculator sales, there was much discussion on the advantages of HPs RPN (postfix notation) vs. TI's algebraic (infix notation). That seems to have gone away now, and is not discussed much. What it always seemed to me, was it was easier to think in postfix terms, but easier to write in infix notation. You won't find algebra or calculus books writing expressions in prefix or postfix form. No idea about mental wiring or being exposed at a young age. I wonder about studies of young(er) kids learning to use calculators or in teaching math to younger kids.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2023-03-25 13:14 +0000 |
| Subject | Re: Lisp syntax, was A simpler way to tokenize and parse? |
| Message-ID | <23-03-018@comp.compilers> |
| In reply to | #3413 |
>[[...] The original plan was that Lisp 2 >would have M expressions that looked more like a normal language but >it's over 50 years later and they still haven't gotten around to it. >-John] Actually they have. Some HOPL paper (or several of them) discuss this: There were several attempts at an Algol-like syntax, but Lisp proprammers found that they preferred programming in S-Expressions over the Algol-like syntax, whether it's M-Expressions, Dylan syntax, or several other attempts. The only language which might be considered a success at having an Algol-like syntax in something similar to Lisp is JavaScript. Maybe this is just because JavaScript is far enough from Lisp not just in syntax, and there is no S-expression syntax for JavaScript, is there? - anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <864-117-4973@kylheku.com> |
|---|---|
| Date | 2023-03-26 00:46 +0000 |
| Subject | Re: Lisp syntax, was A simpler way to tokenize and parse? |
| Message-ID | <23-03-021@comp.compilers> |
| In reply to | #3420 |
On 2023-03-25, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>[[...] The original plan was that Lisp 2 >>would have M expressions that looked more like a normal language but >>it's over 50 years later and they still haven't gotten around to it. >>-John] > > Actually they have. Some HOPL paper (or several of them) discuss > this: There were several attempts at an Algol-like syntax, but Lisp > proprammers found that they preferred programming in S-Expressions > over the Algol-like syntax, whether it's M-Expressions, Dylan syntax, > or several other attempts. The situation is more nuanced. Common Lisp has programmable read tables which let you have any surface syntax, and this is used. It's just not the predominant mode of writing the bulk of the code. For instance, the cl-interpol library provides string syntax with interpolation. There exists an open source module for Common Lisp called named-readtables which provides disciplined registration for managing multiple read-tables. If a developer wants to mix multiple read-table-based syntaxes in the same source, they can clash. Racket is a popular language based on Scheme, which also has programmable syntax. A Racket source file can begin with a #lang directive which indicates which language module is being used; that syntax then applies to the rest of the file. I have the impression that this is farily widely used in the Racket world. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca
[toc] | [prev] | [next] | [standalone]
| From | Lieven Marchand <mal@wyrd.be> |
|---|---|
| Date | 2023-03-25 19:58 +0100 |
| Message-ID | <23-03-019@comp.compilers> |
| In reply to | #3413 |
Roger L Costello <costello@mitre.org> writes: > I have done some work with Flex and Bison and recently I've done some work > with building parsers using read. My experience is the latter is much easier. > Why isn't read more widely discussed and used in the compiler community? > Surely the concept that read embodies is not specific to Lisp and Scheme, > right? Apart from the already mentioned problem that it forces you into a syntax that a lot of people don't like, there's also the problem that you have to deal with hostile input. Where you expect "(+ 2 3)" someone will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security problems in real world settings come from not correctly validating inputs and by the time you have worked around all these problems read isn't all that easy anymore. C for example has a somewhat similar facility scanf that tries to pattern match input and is also considered unsafe. A good rule of thumb for production ready software is to define a grammar for valid input and provide a validating parser. -- Laat hulle almal sterf. Ek is tevrede om die wêreld te sien brand en die vallende konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde.
[toc] | [prev] | [next] | [standalone]
| From | Spiros Bousbouras <spibou@gmail.com> |
|---|---|
| Date | 2023-03-26 14:10 +0000 |
| Message-ID | <23-03-024@comp.compilers> |
| In reply to | #3421 |
On Sat, 25 Mar 2023 19:58:58 +0100 Lieven Marchand <mal@wyrd.be> wrote: > Apart from the already mentioned problem that it forces you into a > syntax that a lot of people don't like, there's also the problem that > you have to deal with hostile input. Where you expect "(+ 2 3)" someone > will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". The ability to read S-expressions does not imply that you must also have a "read and evaluate" facility. In any case , in Common Lisp you can trivially disable it by setting *READ-EVAL* to NIL .
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <864-117-4973@kylheku.com> |
|---|---|
| Date | 2023-03-26 18:19 +0000 |
| Message-ID | <23-03-026@comp.compilers> |
| In reply to | #3421 |
On 2023-03-25, Lieven Marchand <mal@wyrd.be> wrote: > Roger L Costello <costello@mitre.org> writes: > >> I have done some work with Flex and Bison and recently I've done some work >> with building parsers using read. My experience is the latter is much easier. >> Why isn't read more widely discussed and used in the compiler community? >> Surely the concept that read embodies is not specific to Lisp and Scheme, >> right? > > Apart from the already mentioned problem that it forces you into a > syntax that a lot of people don't like, there's also the problem that > you have to deal with hostile input. Where you expect "(+ 2 3)" someone > will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security Not every Lisp dialect has hash-dot read-time evaluation; that's a feature of Common Lisp, disabled by setting/binding *read-eval* to nil. I don't seem to recall that Scheme has it. I deliberately kept it out of TXR Lisp. However, that doesn't disable compile-time evaluation in macros, which kicks in if you feed the read code to the compile function. compile must be regarded the same as eval from a security POV. We are seeing compile-time evaluation in newer languages, though. It's not a bona-fide security issue, except in applications that dynamically compile untrusted input. Since the aim is almost always to execute it, whether the malice happens at compile time or run time. Both have to be sandboxed. When you're building an open-source program, it's a given that you're running its code: shell scripts, make files or what have you. It doesn't need read-time evaluation to perpetrate malice. > problems in real world settings come from not correctly validating > inputs and by the time you have worked around all these problems read > isn't all that easy anymore. C for example has a somewhat similar > facility scanf that tries to pattern match input and is also considered > unsafe. scanf is unsafe, but not in the way that hash-dot read-time evaluation is unsafe. The situations are not comparable. scanf doesn't feature a documented, reliable scan-time programing language that the *input* can use to extend the program which calls scanf, and which can be turned off by a flag. You have to exploit a buffer overflow or whatever, enabled by careless use of scanf. All they have in common is that calling read on untrusted input without disabling *read-eval* is a kind of careless use of read. But setting *read-eval* to nil is s something you may be able to do in just one place in the entire application. (Anything which needs *read-eval* can opt-in using (let ((*read-eval t)) ...) around its calls to read. > A good rule of thumb for production ready software is to define > a grammar for valid input and provide a validating parser. Sure, if you want to waste your time defining grammars and writing validating parsers. This is no longer done that much outside of the Lisp world. People use XML, JSON, Yaml, ..., whose grammar they definitely didn't design or implement, and validate the content/shape of the object that comes out. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca
[toc] | [prev] | [next] | [standalone]
| From | Lieven Marchand <mal@wyrd.be> |
|---|---|
| Date | 2023-03-27 23:15 +0200 |
| Subject | Re: Lisp syntax, A simpler way to tokenize and parse? |
| Message-ID | <23-03-028@comp.compilers> |
| In reply to | #3428 |
Kaz Kylheku <864-117-4973@kylheku.com> writes: > Not every Lisp dialect has hash-dot read-time evaluation; that's > a feature of Common Lisp, disabled by setting/binding *read-eval* > to nil. I don't seem to recall that Scheme has it. I deliberately > kept it out of TXR Lisp. You also should wrap a use of read in a with-standard-io-syntax and even then there can still be surprises in CL. One implementation I used by default still also had the #, that has disappeared from the CLtL2 era of the standard. >> A good rule of thumb for production ready software is to define >> a grammar for valid input and provide a validating parser. > > Sure, if you want to waste your time defining grammars and > writing validating parsers. > > This is no longer done that much outside of the Lisp world. People use XML, > JSON, Yaml, ..., whose grammar they definitely didn't design or > implement, and validate the content/shape of the object that comes out. The Haskell and the Rust world have partially gone to applicative and monadic parsing. In such a paradigm it's not that hard to write grammars. -- Laat hulle almal sterf. Ek is tevrede om die wêreld te sien brand en die vallende konings te spot. Ek en my aasdier sal loop op die as van die verwoeste aarde.
[toc] | [prev] | [next] | [standalone]
| From | Kaz Kylheku <864-117-4973@kylheku.com> |
|---|---|
| Date | 2023-03-26 01:17 +0000 |
| Message-ID | <23-03-023@comp.compilers> |
| In reply to | #3413 |
On 2023-03-24, Roger L Costello <costello@mitre.org> wrote: > Example of tokenizing/parsing using read: > > (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4)) You've not quite hit upon how it works, and I'd encourage you to keep exploring. Read takes the seven characters (+ 3 4) and returns an object which stands for the same thinig. When Lisp programmers discuss that object, they refer to it using the same notation (+ 3 4). Actual copy-paste from a Lisp session: [1]> (read-from-string "(+ 3 4)") (+ 3 4) ; 7 The second return value of read-from-string, 7, isn't the value of the expression; it's the position of the first character of the string which was not read. Our expression is seven characters long. > > The first expression (+ 3 4) is the concrete syntax. > The middle expression (list `+ 3 4) is an s-expression. It is an intermediate > representation. "S-expression" actually refers to the character syntax. The object in memory is just an expression. The reader in Lisps like Scheme and Common Lisp perpetrates no such embellishment. The symbol "list" and quotation around the + will not appear from reading "(+ 3 7)". You get a three-element list, made up out of three cons cells (pair-like objects), whose elements are strictly those that are implied by the read syntax: the + symbol and the two numbers. > The last expression (add (num 3) (num 4)) is the abstract syntax. No such thing is user-visible in any mainstream Lisp. Lisp interpreters directly evaluate the (+ 3 4) object. Lisp compilers potentially build some annotated syntax tree, but this is not a documented feature of any Lisp that I know; it will be an internal matter. Compiling the raw (+ 3 4) form is perfectly possible. > > The book says: read is one of the great ideas of computer science. It helps > decompose a fundamentally difficult process - generalized parsing of the input > stream - into two simple processes: > > (1) reading the input stream into an intermediate representation > (2) parsing that intermediate representation The bigger idea in Lisp is actually "print-read consistency": that objects have a printed notation that the machine can produce, which the machine can read to reproduce a similar object. Not all objects have print-read consistency in Lisp, but things are usualy strict int he mature Lisp dialects. If something doesn't have print-read consistency, it will print in an unreadable form that generates an error. In Common Lisp, the character sequence #< (sharpsign less-than), in the standard read-table, signals an error. Objects which don't have a printed notation that can be read can use that syntax, e.g. #<socket-handle 10.1.2.3:8080>. > I've read several compiler books and none of them talked about this. They talk > about creating a lexer to generate a stream of tokens and a parser that > receives the tokens and arranges them into a tree data structure. Why no > mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of > the great ideas of computer science"? It's because we are not in a branch of the parallel universe in which a lot of people know about and program in Lisp. The Lisp microcosm has a lot to say on many topics, but computing is largely ignorant of it. > I have done some work with Flex and Bison and recently I've done some work > with building parsers using read. My experience is the latter is much easier. > Why isn't read more widely discussed and used in the compiler community? > Surely the concept that read embodies is not specific to Lisp and Scheme, > right? S-expressions do crop up outside of Lisp. The IMAP4 protocol uses them. The GNU C compiler uses a form of S-expression internally. Look up RTL: https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL The Rational Rose object design tool stores files in a S-expression format called Petal. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web