Groups > comp.compilers > #3413 > unrolled thread

A simpler way to tokenize and parse?

Started by	Roger L Costello <costello@mitre.org>
First post	2023-03-24 14:45 +0000
Last post	2023-03-26 01:17 +0000
Articles	10 — 6 participants

Back to article view | Back to comp.compilers

  A simpler way to tokenize and parse? Roger L Costello <costello@mitre.org> - 2023-03-24 14:45 +0000
    Re: Lisp syntax, was A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-25 11:55 +0000
      Re: Lisp syntax, was A simpler way to tokenize and parse? gah4 <gah4@u.washington.edu> - 2023-03-25 14:32 -0700
    Re: Lisp syntax, was A simpler way to tokenize and parse? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-03-25 13:14 +0000
      Re: Lisp syntax, was A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 00:46 +0000
    Re: A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-25 19:58 +0100
      Re: A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-26 14:10 +0000
      Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 18:19 +0000
        Re: Lisp syntax, A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-27 23:15 +0200
    Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 01:17 +0000

#3413 — A simpler way to tokenize and parse?

From	Roger L Costello <costello@mitre.org>
Date	2023-03-24 14:45 +0000
Subject	A simpler way to tokenize and parse?
Message-ID	<23-03-011@comp.compilers>

Hello Compiler Experts!

I am reading the book, "Programming Languages, Application and Interpretation"
by Shriram Krishnamurthi.

The book says that Lisp and Scheme have a primitive called "read".

The book says, "The read primitive is a crown jewel of Lisp and Scheme."

Some of my notes from reading the book:

- Read does tokenizing and reading.
- Read returns a value known as an s-expression.
- The s-expression is an intermediate representation.
- The output of read is either a number or a list. That's it!

Example of tokenizing/parsing using read:

(+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

The first expression (+ 3 4) is the concrete syntax.
The middle expression (list `+ 3 4) is an s-expression. It is an intermediate
representation.
The last expression (add (num 3) (num 4)) is the abstract syntax.

The book says: read is one of the great ideas of computer science. It helps
decompose a fundamentally difficult process - generalized parsing of the input
stream - into two simple processes:

(1) reading the input stream into an intermediate representation
(2) parsing that intermediate representation

I've read several compiler books and none of them talked about this. They talk
about creating a lexer to generate a stream of tokens and a parser that
receives the tokens and arranges them into a tree data structure. Why no
mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of
the great ideas of computer science"?

I have done some work with Flex and Bison and recently I've done some work
with building parsers using read. My experience is the latter is much easier.
Why isn't read more widely discussed and used in the compiler community?
Surely the concept that read embodies is not specific to Lisp and Scheme,
right?

/Roger
[Yes, it's specific to Lisp and Scheme. They have an extremely simple
symtax called S expressions of nested parenthesized lists of space
separated tokens with some quoting. The original plan was that Lisp 2
would have M expressions that looked more like a normal language but
it's over 50 years later and they still haven't gotten around to it.
-John]

[toc] | [next] | [standalone]

#3418 — Re: Lisp syntax, was A simpler way to tokenize and parse?

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-03-25 11:55 +0000
Subject	Re: Lisp syntax, was A simpler way to tokenize and parse?
Message-ID	<23-03-016@comp.compilers>
In reply to	#3413

On Fri, 24 Mar 2023 14:45:40 +0000
Roger L Costello <costello@mitre.org> wrote:
> Hello Compiler Experts!
>
> I am reading the book, "Programming Languages, Application and Interpretation"
> by Shriram Krishnamurthi.
>
> The book says that Lisp and Scheme have a primitive called "read".
>
> The book says, "The read primitive is a crown jewel of Lisp and Scheme."
>
> Some of my notes from reading the book:
>
> - Read does tokenizing and reading.
> - Read returns a value known as an s-expression.
> - The s-expression is an intermediate representation.
> - The output of read is either a number or a list. That's it!

For educational examples perhaps it's only a number or a list. But for real
world usage it has to be more. In Common Lisp it can be a string or a symbol
(an identifier more or less) or a vector or a number of other Common Lisp
types. If the programmer defines his own classes (which then count as new
types) then automatically syntax is created to be able to read and return
such objects too.

[...]

> I've read several compiler books and none of them talked about this. They talk
> about creating a lexer to generate a stream of tokens and a parser that
> receives the tokens and arranges them into a tree data structure. Why no
> mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of
> the great ideas of computer science"?
>
> I have done some work with Flex and Bison and recently I've done some work
> with building parsers using read. My experience is the latter is much easier.
> Why isn't read more widely discussed and used in the compiler community?

Probably because there really isn't much to say. It's straightforward to parse
so if it works for your needs then you don't need to read any compiler books
about it.

> Surely the concept that read embodies is not specific to Lisp and Scheme,
> right?

It is specific to when you have a very simple and uniform syntax and experience
suggests that this isn't to most people's taste. Whether it is a result of
"mental wiring" or tradition (including mathematical tradition) to which one
gets exposed from a young age , I don't know. What I mean by this is that most
people seem to find it easier to read
   a + b * c
as opposed to
   (+ a (* b c))

and I don't know if this is just the result of early exposure or an inherent
part of how most humans' brains function.

Another issue is that sometimes people have to turn mathematical notation in
computer programmes and it is an extra step to transform
a + b * c   to   (+ a (* b c))   regardless of which one finds easier to read
in isolation.

In mathematical logic the formal syntax also specifies a uniform and simple
notation , usually based on parentheses , but even there authors immediately
introduce conventions about operator precedence so that you don't have to
read (and they don't have to type) so many parentheses. So what in formal
syntax would be for example
((A ∧ B) → C)  becomes   A ∧ B → C   where  ∧  is specified to have higher
precedence than  →  .

I note that Forth also has a very simple and uniform syntax and again Forth
isn't very popular.

Moderator wrote:
> /Roger
> [Yes, it's specific to Lisp and Scheme. They have an extremely simple
> symtax called S expressions of nested parenthesized lists of space
> separated tokens with some quoting. The original plan was that Lisp 2
> would have M expressions that looked more like a normal language but
> it's over 50 years later and they still haven't gotten around to it.

Actually Common Lisp has gotten around to it. I have seen Common Lisp
libraries which create a more conventional syntax for Common Lisp and even
claim to retain the power of writing macros. But I've never paid much
attention because the usual Common Lisp syntax works fine for me. So I can't
provide links , perhaps someone on  comp.lang.lisp  will know. I don't think
that such libraries have seen much if any use. From what I recall , even
their authors did not claim that they prefer the different syntax but they
were simply hoping that with a more conventional syntactic wrapper Common
Lisp (or some Lisp) would become more popular ; or perhaps they saw it as an
interesting intellectual exercise.

--
vlaho.ninja/prog

[toc] | [prev] | [next] | [standalone]

#3422 — Re: Lisp syntax, was A simpler way to tokenize and parse?

From	gah4 <gah4@u.washington.edu>
Date	2023-03-25 14:32 -0700
Subject	Re: Lisp syntax, was A simpler way to tokenize and parse?
Message-ID	<23-03-020@comp.compilers>
In reply to	#3418

On Saturday, March 25, 2023 at 7:48:45 AM UTC-7, Spiros Bousbouras wrote:

(snip)

> It is specific to when you have a very simple and uniform syntax and
experience
> suggests that this isn't to most people's taste. Whether it is a result of
> "mental wiring" or tradition (including mathematical tradition) to which one
> gets exposed from a young age , I don't know. What I mean by this is that
most
> people seem to find it easier to read
> a + b * c
> as opposed to
> (+ a (* b c))
>
> and I don't know if this is just the result of early exposure or an inherent
> part of how most humans' brains function.

Years ago, when there was actual competition between TI and HP
for calculator sales, there was much discussion on the advantages
of HPs RPN (postfix notation) vs. TI's algebraic (infix notation).

That seems to have gone away now, and is not discussed much.

What it always seemed to me, was it was easier to think in
postfix terms, but easier to write in infix notation.  You won't
find algebra or calculus books writing expressions in prefix or
postfix form.

No idea about mental wiring or being exposed at a young age.

I wonder about studies of young(er) kids learning to use
calculators or in teaching math to younger kids.

[toc] | [prev] | [next] | [standalone]

#3420 — Re: Lisp syntax, was A simpler way to tokenize and parse?

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2023-03-25 13:14 +0000
Subject	Re: Lisp syntax, was A simpler way to tokenize and parse?
Message-ID	<23-03-018@comp.compilers>
In reply to	#3413

>[[...] The original plan was that Lisp 2
>would have M expressions that looked more like a normal language but
>it's over 50 years later and they still haven't gotten around to it.
>-John]

Actually they have.  Some HOPL paper (or several of them) discuss
this: There were several attempts at an Algol-like syntax, but Lisp
proprammers found that they preferred programming in S-Expressions
over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
or several other attempts.

The only language which might be considered a success at having an
Algol-like syntax in something similar to Lisp is JavaScript.  Maybe
this is just because JavaScript is far enough from Lisp not just in
syntax, and there is no S-expression syntax for JavaScript, is there?

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

[toc] | [prev] | [next] | [standalone]

#3423 — Re: Lisp syntax, was A simpler way to tokenize and parse?

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-03-26 00:46 +0000
Subject	Re: Lisp syntax, was A simpler way to tokenize and parse?
Message-ID	<23-03-021@comp.compilers>
In reply to	#3420

On 2023-03-25, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>>[[...] The original plan was that Lisp 2
>>would have M expressions that looked more like a normal language but
>>it's over 50 years later and they still haven't gotten around to it.
>>-John]
>
> Actually they have.  Some HOPL paper (or several of them) discuss
> this: There were several attempts at an Algol-like syntax, but Lisp
> proprammers found that they preferred programming in S-Expressions
> over the Algol-like syntax, whether it's M-Expressions, Dylan syntax,
> or several other attempts.

The situation is more nuanced. Common Lisp has programmable read tables
which let you have any surface syntax, and this is used. It's just not
the predominant mode of writing the bulk of the code.  For instance, the
cl-interpol library provides string syntax with interpolation.

There exists an open source module for Common Lisp called
named-readtables which provides disciplined registration for managing
multiple read-tables. If a developer wants to mix multiple
read-table-based syntaxes in the same source, they can clash.

Racket is a popular language based on Scheme, which also has
programmable syntax. A Racket source file can begin with a #lang
directive which indicates which language module is being used;
that syntax then applies to the rest of the file. I have the
impression that this is farily widely used in the Racket world.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3421

From	Lieven Marchand <mal@wyrd.be>
Date	2023-03-25 19:58 +0100
Message-ID	<23-03-019@comp.compilers>
In reply to	#3413

Roger L Costello <costello@mitre.org> writes:

> I have done some work with Flex and Bison and recently I've done some work
> with building parsers using read. My experience is the latter is much easier.
> Why isn't read more widely discussed and used in the compiler community?
> Surely the concept that read embodies is not specific to Lisp and Scheme,
> right?

Apart from the already mentioned problem that it forces you into a
syntax that a lot of people don't like, there's also the problem that
you have to deal with hostile input. Where you expect "(+ 2 3)" someone
will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security
problems in real world settings come from not correctly validating
inputs and by the time you have worked around all these problems read
isn't all that easy anymore. C for example has a somewhat similar
facility scanf that tries to pattern match input and is also considered
unsafe. A good rule of thumb for production ready software is to define
a grammar for valid input and provide a validating parser.

--
Laat hulle almal sterf.  Ek is tevrede om die wêreld te sien brand en die vallende
konings te spot.  Ek en my aasdier sal loop op die as van die verwoeste aarde.

[toc] | [prev] | [next] | [standalone]

#3426

From	Spiros Bousbouras <spibou@gmail.com>
Date	2023-03-26 14:10 +0000
Message-ID	<23-03-024@comp.compilers>
In reply to	#3421

On Sat, 25 Mar 2023 19:58:58 +0100
Lieven Marchand <mal@wyrd.be> wrote:
> Apart from the already mentioned problem that it forces you into a
> syntax that a lot of people don't like, there's also the problem that
> you have to deal with hostile input. Where you expect "(+ 2 3)" someone
> will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))".

The ability to read S-expressions does not imply that you must also have
a "read and evaluate" facility. In any case , in Common Lisp you can
trivially disable it by setting *READ-EVAL* to NIL .

[toc] | [prev] | [next] | [standalone]

#3428

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-03-26 18:19 +0000
Message-ID	<23-03-026@comp.compilers>
In reply to	#3421

On 2023-03-25, Lieven Marchand <mal@wyrd.be> wrote:
> Roger L Costello <costello@mitre.org> writes:
>
>> I have done some work with Flex and Bison and recently I've done some work
>> with building parsers using read. My experience is the latter is much easier.
>> Why isn't read more widely discussed and used in the compiler community?
>> Surely the concept that read embodies is not specific to Lisp and Scheme,
>> right?
>
> Apart from the already mentioned problem that it forces you into a
> syntax that a lot of people don't like, there's also the problem that
> you have to deal with hostile input. Where you expect "(+ 2 3)" someone
> will enter "(+ 2 3 #.(progn (launch-the-nukes) 4))". A lot of security

Not every Lisp dialect has hash-dot read-time evaluation; that's
a feature of Common Lisp, disabled by setting/binding *read-eval*
to nil. I don't seem to recall that Scheme has it. I deliberately
kept it out of TXR Lisp.

However, that doesn't disable compile-time evaluation in macros,
which kicks in if you feed the read code to the compile function.
compile must be regarded the same as eval from a security POV.
We are seeing compile-time evaluation in newer languages,
though.

It's not a bona-fide security issue, except in applications that
dynamically compile untrusted input. Since the aim is almost
always to execute it, whether the malice happens at compile
time or run time. Both have to be sandboxed.

When you're building an open-source program, it's a given that
you're running its code: shell scripts, make files or what
have you. It doesn't need read-time evaluation to perpetrate
malice.

> problems in real world settings come from not correctly validating
> inputs and by the time you have worked around all these problems read
> isn't all that easy anymore. C for example has a somewhat similar
> facility scanf that tries to pattern match input and is also considered
> unsafe.

scanf is unsafe, but not in the way that hash-dot read-time evaluation
is unsafe. The situations are not comparable.

scanf doesn't feature a documented, reliable scan-time programing language
that the *input* can use to extend the program which calls scanf,
and which can be turned off by a flag.

You have to exploit a buffer overflow or whatever, enabled by careless
use of scanf.

All they have in common is that calling read on untrusted input without
disabling *read-eval* is a kind of careless use of read.

But setting *read-eval* to nil is s something you may be able to do in
just one place in the entire application. (Anything which needs
*read-eval* can opt-in using (let ((*read-eval t)) ...) around its
calls to read.

> A good rule of thumb for production ready software is to define
> a grammar for valid input and provide a validating parser.

Sure, if you want to waste your time defining grammars and
writing validating parsers.

This is no longer done that much outside of the Lisp world.  People use XML,
JSON, Yaml, ..., whose grammar they definitely didn't design or
implement, and validate the content/shape of the object that comes out.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]

#3430 — Re: Lisp syntax, A simpler way to tokenize and parse?

From	Lieven Marchand <mal@wyrd.be>
Date	2023-03-27 23:15 +0200
Subject	Re: Lisp syntax, A simpler way to tokenize and parse?
Message-ID	<23-03-028@comp.compilers>
In reply to	#3428

Kaz Kylheku <864-117-4973@kylheku.com> writes:

> Not every Lisp dialect has hash-dot read-time evaluation; that's
> a feature of Common Lisp, disabled by setting/binding *read-eval*
> to nil. I don't seem to recall that Scheme has it. I deliberately
> kept it out of TXR Lisp.

You also should wrap a use of read in a with-standard-io-syntax and even
then there can still be surprises in CL. One implementation I used by
default still also had the #, that has disappeared from the CLtL2 era of
the standard.

>> A good rule of thumb for production ready software is to define
>> a grammar for valid input and provide a validating parser.
>
> Sure, if you want to waste your time defining grammars and
> writing validating parsers.
>
> This is no longer done that much outside of the Lisp world.  People use XML,
> JSON, Yaml, ..., whose grammar they definitely didn't design or
> implement, and validate the content/shape of the object that comes out.

The Haskell and the Rust world have partially gone to applicative and
monadic parsing. In such a paradigm it's not that hard to write
grammars.

--
Laat hulle almal sterf.  Ek is tevrede om die wêreld te sien brand en die vallende
konings te spot.  Ek en my aasdier sal loop op die as van die verwoeste aarde.

[toc] | [prev] | [next] | [standalone]

#3425

From	Kaz Kylheku <864-117-4973@kylheku.com>
Date	2023-03-26 01:17 +0000
Message-ID	<23-03-023@comp.compilers>
In reply to	#3413

On 2023-03-24, Roger L Costello <costello@mitre.org> wrote:
> Example of tokenizing/parsing using read:
>
> (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

You've not quite hit upon how it works, and I'd encourage you to keep
exploring.

Read takes the seven characters (+ 3 4) and returns an object
which stands for the same thinig. When Lisp programmers discuss
that object, they refer to it using the same notation (+ 3 4).

Actual copy-paste from a Lisp session:

  [1]> (read-from-string "(+ 3 4)")
  (+ 3 4) ;
  7

The second return value of read-from-string, 7, isn't the
value of the expression; it's the position of the first
character of the string which was not read. Our expression
is seven characters long.
>
> The first expression (+ 3 4) is the concrete syntax.
> The middle expression (list `+ 3 4) is an s-expression. It is an intermediate
> representation.

"S-expression" actually refers to the character syntax. The object
in memory is just an expression.

The reader in Lisps like Scheme and Common Lisp perpetrates no such
embellishment. The symbol "list" and quotation around the + will not
appear from reading "(+ 3 7)". You get a three-element list, made up out
of three cons cells (pair-like objects), whose elements are strictly
those that are implied by the read syntax: the + symbol and the two
numbers.

> The last expression (add (num 3) (num 4)) is the abstract syntax.

No such thing is user-visible in any mainstream Lisp. Lisp interpreters
directly evaluate the (+ 3 4) object.

Lisp compilers potentially build some annotated syntax tree, but
this is not a documented feature of any Lisp that I know; it will be
an internal matter.

Compiling the raw (+ 3 4) form is perfectly possible.
>
> The book says: read is one of the great ideas of computer science. It helps
> decompose a fundamentally difficult process - generalized parsing of the input
> stream - into two simple processes:
>
> (1) reading the input stream into an intermediate representation
> (2) parsing that intermediate representation

The bigger idea in Lisp is actually "print-read consistency": that
objects have a printed notation that the machine can produce, which the
machine can read to reproduce a similar object.

Not all objects have print-read consistency in Lisp, but things are
usualy strict int he mature Lisp dialects. If something doesn't have
print-read consistency, it will print in an unreadable form that
generates an error.

In Common Lisp, the character sequence #< (sharpsign less-than),
in the standard read-table, signals an error. Objects which
don't have a printed notation that can be read can use that
syntax, e.g. #<socket-handle 10.1.2.3:8080>.

> I've read several compiler books and none of them talked about this. They talk
> about creating a lexer to generate a stream of tokens and a parser that
> receives the tokens and arranges them into a tree data structure. Why no
> mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of
> the great ideas of computer science"?

It's because we are not in a branch of the parallel universe in which a
lot of people know about and program in Lisp.

The Lisp microcosm has a lot to say on many topics, but computing
is largely ignorant of it.

> I have done some work with Flex and Bison and recently I've done some work
> with building parsers using read. My experience is the latter is much easier.
> Why isn't read more widely discussed and used in the compiler community?
> Surely the concept that read embodies is not specific to Lisp and Scheme,
> right?

S-expressions do crop up outside of Lisp.

The IMAP4 protocol uses them.

The GNU C compiler uses a form of S-expression internally.
Look up RTL:

https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL

The Rational Rose object design tool stores files in a S-expression
format called Petal.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [standalone]

csiph-web

A simpler way to tokenize and parse?

Contents

#3413 — A simpler way to tokenize and parse?

#3418 — Re: Lisp syntax, was A simpler way to tokenize and parse?

#3422 — Re: Lisp syntax, was A simpler way to tokenize and parse?

#3420 — Re: Lisp syntax, was A simpler way to tokenize and parse?

#3423 — Re: Lisp syntax, was A simpler way to tokenize and parse?

#3421

#3426

#3428

#3430 — Re: Lisp syntax, A simpler way to tokenize and parse?

#3425