Groups > comp.lang.forth > #24933 > unrolled thread

Scanning versus Parsing

Started by	Ian van Breda <igvb@btopenworld.com>
First post	2013-08-02 13:54 +0100
Last post	2013-08-15 11:07 +0100
Articles	9 — 5 participants

Back to article view | Back to comp.lang.forth

  Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-02 13:54 +0100
    Re: Scanning versus Parsing Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-08-02 09:14 -0500
      Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-06 09:24 +0100
    Re: Scanning versus Parsing anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2013-08-02 15:25 +0000
    Re: Scanning versus Parsing albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-08-06 10:28 +0000
      Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-06 14:31 +0100
        Re: Scanning versus Parsing albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-08-06 17:47 +0000
    Re: Scanning versus Parsing Hans Bezemer <the.beez.speaks@gmail.com> - 2013-08-11 19:20 +0200
      Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-15 11:07 +0100

#24933 — Scanning versus Parsing

From	Ian van Breda <igvb@btopenworld.com>
Date	2013-08-02 13:54 +0100
Subject	Scanning versus Parsing
Message-ID	<CE216AB0.36B5%igvb@btopenworld.com>

The terms 'parse' and 'parsing' are widely use in the documentation on
both in the original ANS Standard of 1994, [1], and in the draft
Proposal, [2].
 
However, the term 'parsing', in both the treatment of natural language
and the theory of computer languages, implies analysis of the structure
of a sentence or language construct. In computing it refers to checking
that the tokens extracted from the source code of a program follow the
grammatical rules of the language [3] and [4]. There are many examples
in the literature.
 
For example, parsing applies to the 'productions' that form the grammar
of a language, such as in a while statement in Pascal
 
   <while_statement>  ->  while <boolean_expression> do <statement> ;
 
which follows the rule that a while statement must begin with a 'while'
token, followed by a 'boolean expression'.  This in turn is followed by
a 'do' token, followed by a 'statement', which itself may be a compound
statement.
 
The universally accepted term for extracting language tokens from the
input source code, as elsewhere in computing, is 'scanning'.  In Forth
this is done by extracting words delimited by spaces, tabs and line
endings.
 
This suggests that the vast majority of the descriptive text in the
proposed Standard needs to be changed from variants of 'parse' to their
'scan' equivalents
 
We cannot change PARSE itself, as it is now cast in stone (the penalty
for inaccuracy in setting up the ANS Standard). However, it is possible
to adopt SCAN-NAME in place of PARSE-NAME. It would also be useful to have
a multi-line SCAN which bypasses comments.
 
The argument that it has always been done this way is surely not valid
in the face of the fundamental usage of the term, parsing, elsewhere:
Forth cannot ignore the outside world and, at least in this case, it is
irrefutably common practice to use scanning for this type of process.
Forth seems to be unique in using the term parsing instead of scanning.
 
This problem is highlighted when we try to use Forth to generate parsers
for other languages, for which Forth works extremely well.
 
 
[1] ANS Forth (ANS X3.215-1994 Information Systems  Programming Languages
   FORTH), 1994.
[2] Forth Standards Committee. Forth 200x Draft 11.1. 29th February 2012.
[3] C. N. Fischer, R. K. Cytron and R. J. LeBlanc. Crafting a Compiler.
   Pearson Education, Inc, 2010.
[4] I.G. van Breda. Building an LR parser for Pascal using Forth.
   EuroForth 2012.

Ian van Breda

[toc] | [next] | [standalone]

#24936

From	Andrew Haley <andrew29@littlepinkcloud.invalid>
Date	2013-08-02 09:14 -0500
Message-ID	<8IWdnZJHcowtI2bMnZ2dnUVZ_h2dnZ2d@supernews.com>
In reply to	#24933

Ian van Breda <igvb@btopenworld.com> wrote:

> The terms 'parse' and 'parsing' are widely use in the documentation
> on both in the original ANS Standard of 1994, [1], and in the draft
> Proposal, [2].
> 
> However, the term 'parsing', in both the treatment of natural
> language and the theory of computer languages, implies analysis of
> the structure of a sentence or language construct. In computing it
> refers to checking that the tokens extracted from the source code of
> a program follow the grammatical rules of the language [3] and
> [4]. There are many examples in the literature.
> 
> For example, parsing applies to the 'productions' that form the
> grammar of a language, such as in a while statement in Pascal
> 
>   <while_statement>  ->  while <boolean_expression> do <statement> ;
> 
> which follows the rule that a while statement must begin with a
> 'while' token, followed by a 'boolean expression'.  This in turn is
> followed by a 'do' token, followed by a 'statement', which itself
> may be a compound statement.
> 
> The universally accepted term for extracting language tokens from
> the input source code, as elsewhere in computing, is 'scanning'.  In
> Forth this is done by extracting words delimited by spaces, tabs and
> line endings.
> 
> This suggests that the vast majority of the descriptive text in the
> proposed Standard needs to be changed from variants of 'parse' to
> their 'scan' equivalents

No, it doesn't.  The term is defined in Section 2.1, Definitions of
terms, and is consistent with historic Forth usage.  There are many
terms used in Forth which are not consistent with the rest of computer
science: "word" is one such.

> The argument that it has always been done this way is surely not
> valid in the face of the fundamental usage of the term, parsing,
> elsewhere: Forth cannot ignore the outside world and, at least in
> this case, it is irrefutably common practice to use scanning for
> this type of process. 

I take your point, but it's hardly the job of the committee to
determine what language Forthers use.  We could change to match
commonplace CS use, at the cost of being incosistent with all the
Forth literature.  That's not a good trade-off.

> Forth seems to be unique in using the term parsing instead of
> scanning.

That's true, but Forth uses its own terminology for many things, and I
don't think we should break with forty years of usage.  Vive la
difference!

Andrew.

[toc] | [prev] | [next] | [standalone]

#25015

From	Ian van Breda <igvb@btopenworld.com>
Date	2013-08-06 09:24 +0100
Message-ID	<CE267166.3731%igvb@btopenworld.com>
In reply to	#24936

Andrew Haley wrote on 02/08/2013 15:14:

> Ian van Breda <igvb@btopenworld.com> wrote:
> 
>> The universally accepted term for extracting language tokens from
>> the input source code, as elsewhere in computing, is 'scanning'.  In
>> Forth this is done by extracting words delimited by spaces, tabs and
>> line endings.
>> 
>> This suggests that the vast majority of the descriptive text in the
>> proposed Standard needs to be changed from variants of 'parse' to
>> their 'scan' equivalents
> 
> No, it doesn't.  The term is defined in Section 2.1, Definitions of
> terms, and is consistent with historic Forth usage.  There are many
> terms used in Forth which are not consistent with the rest of computer
> science: "word" is one such.

Irrespective of what the standard may say, my dictionary describes parsing
as: 'Describe (word) grammatically, stating inflexion, relation to sentence,
etc.; resolve (sentence) into component parts of speech and describe them
grammatically'.

'Word' does indeed have many meanings in common use, so is not relevant in
this context.

>> The argument that it has always been done this way is surely not
>> valid in the face of the fundamental usage of the term, parsing,
>> elsewhere: Forth cannot ignore the outside world and, at least in
>> this case, it is irrefutably common practice to use scanning for
>> this type of process.
> 
> I take your point, but it's hardly the job of the committee to
> determine what language Forthers use.  We could change to match
> commonplace CS use, at the cost of being incosistent with all the
> Forth literature.  That's not a good trade-off.
> 
>> Forth seems to be unique in using the term parsing instead of
>> scanning.
> 
> That's true, but Forth uses its own terminology for many things, and I
> don't think we should break with forty years of usage.  Vive la
> difference!
> 

Unfortunately, the term 'parsing', as used in Forth (no grammatical
context), conflicts with the term as used in other languages and also in the
theory of natural languages.  I have a number of books on compiling, all
starting with scanning followed by parsing of the grammar as a separate
activity: Gries, Fischer et al, Wait et al, Appel, Jensen at al., typically
use a BNF form of grammar, or something similar.  This is the only conflict
of this type in either the ANS Standard or in the proposed standard that I
can see.

It was clear to me, in the context of large telescope systems and in the
laboratory, that it was necessary to be able accommodate a variety of
languages in the Forth environment, particularly, in respect of imaging
systems, where there is a variety of libraries and numerical algorithms
available.

One of the main complaints made by astronomers was that Forth cannot in be
easily integrated with other languages.  However, there is a problem with
*other* languages in that they generally us a single stack for both data and
return addresses.  Forth has a considerable advantage in using separate data
and return stacks, particularly allowing compound definitions to be
implemented.  However, this is not a problem *if* the other language uses a
separate data stack, in which case, we can integrate the two approaches.
This is a failing in the *implementation* of other languages.

Another advantage of Forth over other operating systems, is in it's use of
multi-tasking by its very nature: the original Forth came with multi-tasking
as part and parcel of the system, both terminal and background tasks.  In
this context, it was easy to implement time-slicing.  This allows the system
to respond to interrupts very efficiently for both low level and high level
events.  By using a task-specific user-table means that the response to
interrupts is as good as it gets.

The problem here was to build a compiler for other languages which could be
integrated into Forth.  The cornerstones in generating such a compiler are
scanning and parsing.

For the first part, scanning, is particularly easy in Forth, where the
source text is done (mostly) by splitting it into 'words', but in languages
such as C or Pascal, a rather more complicated process is needed, e.g.
   x=y+z;
will need to be separated into six tokens, but all juxtaposed in this case.

The second part is to treat the grammar as a BNF file that can be simply
INCLUDEd.  Each 'production' is similar to a series of named colon-style
definitions, one for each type of 'nonterminal' (left-hand side).  It is a
bit more complicated than a series of colon definitions in that there may
have more that one 'production' for a given definition and two or more
productions may begin with the the same phrase.

This can be used for to generate parse tables in Forth for an given language
that uses a BNF-style of grammar.  The resulting tables can both be used in
Forth itself or on any other platform and can be used to generates compilers
for different languages.

The result is both LL and LR compatible - usually a simplified version of LR
compatibility is used but Forth generates fully LR compatible tables.

The problem comes with the use of parsing instead of scanning where the
Forth standard meets the other languages head-on.

Of course, it sounds better if you use 'parsing' instead of the rather more
mundane 'scanning'.

[toc] | [prev] | [next] | [standalone]

#24941

From	anton@mips.complang.tuwien.ac.at (Anton Ertl)
Date	2013-08-02 15:25 +0000
Message-ID	<2013Aug2.172531@mips.complang.tuwien.ac.at>
In reply to	#24933

Ian van Breda <igvb@btopenworld.com> writes:
>The argument that it has always been done this way is surely not valid
>in the face of the fundamental usage of the term, parsing, elsewhere:
>Forth cannot ignore the outside world and, at least in this case, it is
>irrefutably common practice to use scanning for this type of process.
>Forth seems to be unique in using the term parsing instead of scanning.
> 
>This problem is highlighted when we try to use Forth to generate parsers
>for other languages, for which Forth works extremely well.

I have written (and documented) a parser generator in Forth
<http://www.complang.tuwien.ac.at/forth/gray.zip>, and did not
encounter any problems.  Also, the feedback I got back from users did
not show any confusion between compiler terminology and Forth
terminology.

More generally, I don't remember any confusion between Forth
terminology with other terminology on scanning/parsing in other
contexts (and parsing words have been discussed repeatedly).

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2013: http://www.euroforth.org/ef13/

[toc] | [prev] | [next] | [standalone]

#25017

From	albert@spenarnc.xs4all.nl (Albert van der Horst)
Date	2013-08-06 10:28 +0000
Message-ID	<5200cfb4$0$612$e4fe514c@dreader34.news.xs4all.nl>
In reply to	#24933

In article <CE216AB0.36B5%igvb@btopenworld.com>,
Ian van Breda  <igvb@btopenworld.com> wrote:
>The terms 'parse' and 'parsing' are widely use in the documentation on
>both in the original ANS Standard of 1994, [1], and in the draft
>Proposal, [2].
>
>However, the term 'parsing', in both the treatment of natural language
>and the theory of computer languages, implies analysis of the structure
>of a sentence or language construct. In computing it refers to checking
>that the tokens extracted from the source code of a program follow the
>grammatical rules of the language [3] and [4]. There are many examples
>in the literature.

According to this definition there is no parsing in Forth.
Blame us for using the term parsing for the nearest thing.

>Ian van Breda

Groetjes Albert
>
>
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

#25030

From	Ian van Breda <igvb@btopenworld.com>
Date	2013-08-06 14:31 +0100
Message-ID	<CE26B94B.3874%igvb@btopenworld.com>
In reply to	#25017

in article Albert van der Horst at wrote on 06/08/2013 11:28:

> In article <CE216AB0.36B5%igvb@btopenworld.com>,
> Ian van Breda  <igvb@btopenworld.com> wrote:
>> The terms 'parse' and 'parsing' are widely use in the documentation on
>> both in the original ANS Standard of 1994, [1], and in the draft
>> Proposal, [2].
>> 
>> However, the term 'parsing', in both the treatment of natural language
>> and the theory of computer languages, implies analysis of the structure
>> of a sentence or language construct. In computing it refers to checking
>> that the tokens extracted from the source code of a program follow the
>> grammatical rules of the language [3] and [4]. There are many examples
>> in the literature.
> 
> According to this definition there is no parsing in Forth.
> Blame us for using the term parsing for the nearest thing.
>> 
>> 
It would have been less confusing if the standards committee had used
scanning instead of parsing.

[toc] | [prev] | [next] | [standalone]

#25037

From	albert@spenarnc.xs4all.nl (Albert van der Horst)
Date	2013-08-06 17:47 +0000
Message-ID	<520136b5$0$3184$e4fe514c@dreader36.news.xs4all.nl>
In reply to	#25030

In article <CE26B94B.3874%igvb@btopenworld.com>,
Ian van Breda  <igvb@btopenworld.com> wrote:
>in article Albert van der Horst at wrote on 06/08/2013 11:28:
>
>> In article <CE216AB0.36B5%igvb@btopenworld.com>,
>> Ian van Breda  <igvb@btopenworld.com> wrote:
>>> The terms 'parse' and 'parsing' are widely use in the documentation on
>>> both in the original ANS Standard of 1994, [1], and in the draft
>>> Proposal, [2].
>>>
>>> However, the term 'parsing', in both the treatment of natural language
>>> and the theory of computer languages, implies analysis of the structure
>>> of a sentence or language construct. In computing it refers to checking
>>> that the tokens extracted from the source code of a program follow the
>>> grammatical rules of the language [3] and [4]. There are many examples
>>> in the literature.
>>
>> According to this definition there is no parsing in Forth.
>> Blame us for using the term parsing for the nearest thing.
>>>
>>>
>It would have been less confusing if the standards committee had used
>scanning instead of parsing.

I doubt it. The standards committee used decennia-old Forth terminology.

I've redefined a non-quite->IN to PP (parse-pointer) in yourforth.
It points to the next character in the input stream.
Should that be named SP (scan-pointer)?

Groetjes Albert
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

[toc] | [prev] | [next] | [standalone]

#25103

From	Hans Bezemer <the.beez.speaks@gmail.com>
Date	2013-08-11 19:20 +0200
Message-ID	<5207c6fb$0$15869$e4fe514c@news2.news.xs4all.nl>
In reply to	#24933

Ian van Breda wrote:
> The universally accepted term for extracting language tokens from the
> input source code, as elsewhere in computing, is 'scanning'.  In Forth
> this is done by extracting words delimited by spaces, tabs and line
> endings.
I don't agree: I know the process as "tokenizing" (see:
http://en.wikipedia.org/wiki/Tokenization). I know that very well, because
tokenizing is a separate process in my 4tH compiler (it not only
eliminates "double" delimiters and the various whitespaces, it also takes
care of strings and such) which also classifies tokens to type while doing
it.

After that, if parses the whole bunch AND compiles it at the same time by
simply executing its compile time behavior.

In Forth similar stuff is done: a whitespace is scanned and then words are
executed - which IMHO is "analysing a string of symbols, either in natural
language or in computer languages, according to the rules of a formal
grammar". In as much Forth HAS a formal grammer.

E.g. IF is parsed by scanning a whitespace, executing its compile time
behavior and compiling a "0-BRANCH" instruction. Many things are happening
at the same time, which causes all this confusion.

Calling it "scanning" is as much right or wrong as calling it "parsing" (in
its classic meaning), "execution" or "compilation".

Note we also say we "compile" a Forth source while true (wordwise)
compilation is just a small part of the process.

So, in short: don't try to be a purist when the whole vocabulary concerning
the process is blurred already. You won't fix anything, at best you're
confusing things even more.

Hans Bezemer

[toc] | [prev] | [next] | [standalone]

#25216

From	Ian van Breda <igvb@btopenworld.com>
Date	2013-08-15 11:07 +0100
Message-ID	<CE3266E0.396C%igvb@btopenworld.com>
In reply to	#25103

Hans Bezemer wrote on 11/08/2013 18:20:

> Ian van Breda wrote:
>> The universally accepted term for extracting language tokens from the
>> input source code, as elsewhere in computing, is 'scanning'.  In Forth
>> this is done by extracting words delimited by spaces, tabs and line
>> endings.
> I don't agree: I know the process as "tokenizing" (see:
> http://en.wikipedia.org/wiki/Tokenization). I know that very well, because
> tokenizing is a separate process in my 4tH compiler (it not only
> eliminates "double" delimiters and the various whitespaces, it also takes
> care of strings and such) which also classifies tokens to type while doing
> it.
> 
> After that, if parses the whole bunch AND compiles it at the same time by
> simply executing its compile time behavior.
> 
> In Forth similar stuff is done: a whitespace is scanned and then words are
> executed - which IMHO is "analysing a string of symbols, either in natural
> language or in computer languages, according to the rules of a formal
> grammar". In as much Forth HAS a formal grammer.
> 
> E.g. IF is parsed by scanning a whitespace, executing its compile time
> behavior and compiling a "0-BRANCH" instruction. Many things are happening
> at the same time, which causes all this confusion.
> 
> Calling it "scanning" is as much right or wrong as calling it "parsing" (in
> its classic meaning), "execution" or "compilation".
> 
> Note we also say we "compile" a Forth source while true (wordwise)
> compilation is just a small part of the process.
> 
> So, in short: don't try to be a purist when the whole vocabulary concerning
> the process is blurred already. You won't fix anything, at best you're
> confusing things even more.
> 
> Hans Bezemer
> 
Wikipedia indeed refers to tokenisation but it is more commonly known as
scanning, see books by Gries, Fischer et al, Waite et al.  Appel refers to
Lexical Analysis which amounts to the same thing.

All of the above refer to parsing of the formal grammar.  See, for example,
Fischer et al, 'Crafting a Compiler', Chapter 3 'Theory and Practice of
Scanning' and Chapter 4 'Formal Grammars and Parsing', which separate the
scanning/tokenisation from the parsing process.

Forth does not have a formal grammar (e.g. BNF) in the sense that C and
Pascal do.  However, we still need to use, e.g., IF ... ELSE ... THEN, which
must be complete within a single definition.  Many Forth's include some form
of syntax checking, to ensure that the IF/ELSE/THEN is complete and also to
check that the return stack is balanced.  For example, we can use
IF  ... >R  ELSE  ... >R  THEN  ... R> and still balance the return stack.
But this follows both the scanning/tokenisation and parsing processes.

IF and '0-Brach' are something of a red herring in this context, in that
they come later the compiling process.  In particular, it comes under the
general heading of code generation.

It is this lack of a formal grammar (in the BNF sense) that gives Forth its
exceptional power.  For example, a Pascal LL/LR parser can be built on the
BNF formal grammar by simply INCLUDEing a file containing the 'productions'
of the grammar. e.g.
   <program_parameter_list>  ->  '(' <file_variable_list> ')' ;
where *all* of the words are simply Forth definitions held in the
dictionary, a unique feature of Forth.  This can be used to build a series
if parse tables for a variety of languages that are independent of the
platform on which the compiler is run.

In using the terms 'scanning' and, particularly, 'parsing', I am using the
terms as used elsewhere, both in compiler technology and in the theory of
natural languages.  The definition is also as described in The Oxford
Illustrated Dictionary and certainly suffers no 'blurring' or confusion in
its meaning.

This is nothing to do with being a 'purist'.  I am not 'trying to fix'
anything either.  Instead, I am seeking a unification and clarification
between Forth and other languages and for which the meaning is very clear
and unequivocal.  Of course, if you think that this is a waste of time, then
we shall just have to disagree.

Ian van Breda

[toc] | [prev] | [standalone]

csiph-web

Scanning versus Parsing

Contents

#24933 — Scanning versus Parsing

#24936

#25015

#24941

#25017

#25030

#25037

#25103

#25216