Re: A simpler way to tokenize and parse?

From	Kaz Kylheku <864-117-4973@kylheku.com>
Newsgroups	comp.compilers
Subject	Re: A simpler way to tokenize and parse?
Date	2023-03-26 01:17 +0000
Organization	A noiseless patient Spider
Message-ID	<23-03-023@comp.compilers> (permalink)
References	<23-03-011@comp.compilers>

Show all headers | View raw

On 2023-03-24, Roger L Costello <costello@mitre.org> wrote:
> Example of tokenizing/parsing using read:
>
> (+ 3 4) --> read --> (list `+ 3 4) --> parse --> (add (num 3) (num 4))

You've not quite hit upon how it works, and I'd encourage you to keep
exploring.

Read takes the seven characters (+ 3 4) and returns an object
which stands for the same thinig. When Lisp programmers discuss
that object, they refer to it using the same notation (+ 3 4).

Actual copy-paste from a Lisp session:

  [1]> (read-from-string "(+ 3 4)")
  (+ 3 4) ;
  7

The second return value of read-from-string, 7, isn't the
value of the expression; it's the position of the first
character of the string which was not read. Our expression
is seven characters long.
>
> The first expression (+ 3 4) is the concrete syntax.
> The middle expression (list `+ 3 4) is an s-expression. It is an intermediate
> representation.

"S-expression" actually refers to the character syntax. The object
in memory is just an expression.

The reader in Lisps like Scheme and Common Lisp perpetrates no such
embellishment. The symbol "list" and quotation around the + will not
appear from reading "(+ 3 7)". You get a three-element list, made up out
of three cons cells (pair-like objects), whose elements are strictly
those that are implied by the read syntax: the + symbol and the two
numbers.

> The last expression (add (num 3) (num 4)) is the abstract syntax.

No such thing is user-visible in any mainstream Lisp. Lisp interpreters
directly evaluate the (+ 3 4) object.

Lisp compilers potentially build some annotated syntax tree, but
this is not a documented feature of any Lisp that I know; it will be
an internal matter.

Compiling the raw (+ 3 4) form is perfectly possible.
>
> The book says: read is one of the great ideas of computer science. It helps
> decompose a fundamentally difficult process - generalized parsing of the input
> stream - into two simple processes:
>
> (1) reading the input stream into an intermediate representation
> (2) parsing that intermediate representation

The bigger idea in Lisp is actually "print-read consistency": that
objects have a printed notation that the machine can produce, which the
machine can read to reproduce a similar object.

Not all objects have print-read consistency in Lisp, but things are
usualy strict int he mature Lisp dialects. If something doesn't have
print-read consistency, it will print in an unreadable form that
generates an error.

In Common Lisp, the character sequence #< (sharpsign less-than),
in the standard read-table, signals an error. Objects which
don't have a printed notation that can be read can use that
syntax, e.g. #<socket-handle 10.1.2.3:8080>.

> I've read several compiler books and none of them talked about this. They talk
> about creating a lexer to generate a stream of tokens and a parser that
> receives the tokens and arranges them into a tree data structure. Why no
> mention of the "crown jewel" of tokenizing/parsing? Why no mention of "one of
> the great ideas of computer science"?

It's because we are not in a branch of the parallel universe in which a
lot of people know about and program in Lisp.

The Lisp microcosm has a lot to say on many topics, but computing
is largely ignorant of it.

> I have done some work with Flex and Bison and recently I've done some work
> with building parsers using read. My experience is the latter is much easier.
> Why isn't read more widely discussed and used in the compiler community?
> Surely the concept that read embodies is not specific to Lisp and Scheme,
> right?

S-expressions do crop up outside of Lisp.

The IMAP4 protocol uses them.

The GNU C compiler uses a form of S-expression internally.
Look up RTL:

https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL

The Rational Rose object design tool stores files in a S-expression
format called Petal.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Back to comp.compilers | Previous | Next — Previous in thread | Find similar

Thread

A simpler way to tokenize and parse? Roger L Costello <costello@mitre.org> - 2023-03-24 14:45 +0000
  Re: Lisp syntax, was A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-25 11:55 +0000
    Re: Lisp syntax, was A simpler way to tokenize and parse? gah4 <gah4@u.washington.edu> - 2023-03-25 14:32 -0700
  Re: Lisp syntax, was A simpler way to tokenize and parse? anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2023-03-25 13:14 +0000
    Re: Lisp syntax, was A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 00:46 +0000
  Re: A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-25 19:58 +0100
    Re: A simpler way to tokenize and parse? Spiros Bousbouras <spibou@gmail.com> - 2023-03-26 14:10 +0000
    Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 18:19 +0000
      Re: Lisp syntax, A simpler way to tokenize and parse? Lieven Marchand <mal@wyrd.be> - 2023-03-27 23:15 +0200
  Re: A simpler way to tokenize and parse? Kaz Kylheku <864-117-4973@kylheku.com> - 2023-03-26 01:17 +0000

csiph-web