Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.forth > #24933 > unrolled thread
| Started by | Ian van Breda <igvb@btopenworld.com> |
|---|---|
| First post | 2013-08-02 13:54 +0100 |
| Last post | 2013-08-15 11:07 +0100 |
| Articles | 9 — 5 participants |
Back to article view | Back to comp.lang.forth
Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-02 13:54 +0100
Re: Scanning versus Parsing Andrew Haley <andrew29@littlepinkcloud.invalid> - 2013-08-02 09:14 -0500
Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-06 09:24 +0100
Re: Scanning versus Parsing anton@mips.complang.tuwien.ac.at (Anton Ertl) - 2013-08-02 15:25 +0000
Re: Scanning versus Parsing albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-08-06 10:28 +0000
Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-06 14:31 +0100
Re: Scanning versus Parsing albert@spenarnc.xs4all.nl (Albert van der Horst) - 2013-08-06 17:47 +0000
Re: Scanning versus Parsing Hans Bezemer <the.beez.speaks@gmail.com> - 2013-08-11 19:20 +0200
Re: Scanning versus Parsing Ian van Breda <igvb@btopenworld.com> - 2013-08-15 11:07 +0100
| From | Ian van Breda <igvb@btopenworld.com> |
|---|---|
| Date | 2013-08-02 13:54 +0100 |
| Subject | Scanning versus Parsing |
| Message-ID | <CE216AB0.36B5%igvb@btopenworld.com> |
The terms 'parse' and 'parsing' are widely use in the documentation on both in the original ANS Standard of 1994, [1], and in the draft Proposal, [2]. However, the term 'parsing', in both the treatment of natural language and the theory of computer languages, implies analysis of the structure of a sentence or language construct. In computing it refers to checking that the tokens extracted from the source code of a program follow the grammatical rules of the language [3] and [4]. There are many examples in the literature. For example, parsing applies to the 'productions' that form the grammar of a language, such as in a while statement in Pascal <while_statement> -> while <boolean_expression> do <statement> ; which follows the rule that a while statement must begin with a 'while' token, followed by a 'boolean expression'. This in turn is followed by a 'do' token, followed by a 'statement', which itself may be a compound statement. The universally accepted term for extracting language tokens from the input source code, as elsewhere in computing, is 'scanning'. In Forth this is done by extracting words delimited by spaces, tabs and line endings. This suggests that the vast majority of the descriptive text in the proposed Standard needs to be changed from variants of 'parse' to their 'scan' equivalents We cannot change PARSE itself, as it is now cast in stone (the penalty for inaccuracy in setting up the ANS Standard). However, it is possible to adopt SCAN-NAME in place of PARSE-NAME. It would also be useful to have a multi-line SCAN which bypasses comments. The argument that it has always been done this way is surely not valid in the face of the fundamental usage of the term, parsing, elsewhere: Forth cannot ignore the outside world and, at least in this case, it is irrefutably common practice to use scanning for this type of process. Forth seems to be unique in using the term parsing instead of scanning. This problem is highlighted when we try to use Forth to generate parsers for other languages, for which Forth works extremely well. [1] ANS Forth (ANS X3.215-1994 Information Systems Programming Languages FORTH), 1994. [2] Forth Standards Committee. Forth 200x Draft 11.1. 29th February 2012. [3] C. N. Fischer, R. K. Cytron and R. J. LeBlanc. Crafting a Compiler. Pearson Education, Inc, 2010. [4] I.G. van Breda. Building an LR parser for Pascal using Forth. EuroForth 2012. Ian van Breda
[toc] | [next] | [standalone]
| From | Andrew Haley <andrew29@littlepinkcloud.invalid> |
|---|---|
| Date | 2013-08-02 09:14 -0500 |
| Message-ID | <8IWdnZJHcowtI2bMnZ2dnUVZ_h2dnZ2d@supernews.com> |
| In reply to | #24933 |
Ian van Breda <igvb@btopenworld.com> wrote: > The terms 'parse' and 'parsing' are widely use in the documentation > on both in the original ANS Standard of 1994, [1], and in the draft > Proposal, [2]. > > However, the term 'parsing', in both the treatment of natural > language and the theory of computer languages, implies analysis of > the structure of a sentence or language construct. In computing it > refers to checking that the tokens extracted from the source code of > a program follow the grammatical rules of the language [3] and > [4]. There are many examples in the literature. > > For example, parsing applies to the 'productions' that form the > grammar of a language, such as in a while statement in Pascal > > <while_statement> -> while <boolean_expression> do <statement> ; > > which follows the rule that a while statement must begin with a > 'while' token, followed by a 'boolean expression'. This in turn is > followed by a 'do' token, followed by a 'statement', which itself > may be a compound statement. > > The universally accepted term for extracting language tokens from > the input source code, as elsewhere in computing, is 'scanning'. In > Forth this is done by extracting words delimited by spaces, tabs and > line endings. > > This suggests that the vast majority of the descriptive text in the > proposed Standard needs to be changed from variants of 'parse' to > their 'scan' equivalents No, it doesn't. The term is defined in Section 2.1, Definitions of terms, and is consistent with historic Forth usage. There are many terms used in Forth which are not consistent with the rest of computer science: "word" is one such. > The argument that it has always been done this way is surely not > valid in the face of the fundamental usage of the term, parsing, > elsewhere: Forth cannot ignore the outside world and, at least in > this case, it is irrefutably common practice to use scanning for > this type of process. I take your point, but it's hardly the job of the committee to determine what language Forthers use. We could change to match commonplace CS use, at the cost of being incosistent with all the Forth literature. That's not a good trade-off. > Forth seems to be unique in using the term parsing instead of > scanning. That's true, but Forth uses its own terminology for many things, and I don't think we should break with forty years of usage. Vive la difference! Andrew.
[toc] | [prev] | [next] | [standalone]
| From | Ian van Breda <igvb@btopenworld.com> |
|---|---|
| Date | 2013-08-06 09:24 +0100 |
| Message-ID | <CE267166.3731%igvb@btopenworld.com> |
| In reply to | #24936 |
Andrew Haley wrote on 02/08/2013 15:14: > Ian van Breda <igvb@btopenworld.com> wrote: > >> The universally accepted term for extracting language tokens from >> the input source code, as elsewhere in computing, is 'scanning'. In >> Forth this is done by extracting words delimited by spaces, tabs and >> line endings. >> >> This suggests that the vast majority of the descriptive text in the >> proposed Standard needs to be changed from variants of 'parse' to >> their 'scan' equivalents > > No, it doesn't. The term is defined in Section 2.1, Definitions of > terms, and is consistent with historic Forth usage. There are many > terms used in Forth which are not consistent with the rest of computer > science: "word" is one such. Irrespective of what the standard may say, my dictionary describes parsing as: 'Describe (word) grammatically, stating inflexion, relation to sentence, etc.; resolve (sentence) into component parts of speech and describe them grammatically'. 'Word' does indeed have many meanings in common use, so is not relevant in this context. >> The argument that it has always been done this way is surely not >> valid in the face of the fundamental usage of the term, parsing, >> elsewhere: Forth cannot ignore the outside world and, at least in >> this case, it is irrefutably common practice to use scanning for >> this type of process. > > I take your point, but it's hardly the job of the committee to > determine what language Forthers use. We could change to match > commonplace CS use, at the cost of being incosistent with all the > Forth literature. That's not a good trade-off. > >> Forth seems to be unique in using the term parsing instead of >> scanning. > > That's true, but Forth uses its own terminology for many things, and I > don't think we should break with forty years of usage. Vive la > difference! > Unfortunately, the term 'parsing', as used in Forth (no grammatical context), conflicts with the term as used in other languages and also in the theory of natural languages. I have a number of books on compiling, all starting with scanning followed by parsing of the grammar as a separate activity: Gries, Fischer et al, Wait et al, Appel, Jensen at al., typically use a BNF form of grammar, or something similar. This is the only conflict of this type in either the ANS Standard or in the proposed standard that I can see. It was clear to me, in the context of large telescope systems and in the laboratory, that it was necessary to be able accommodate a variety of languages in the Forth environment, particularly, in respect of imaging systems, where there is a variety of libraries and numerical algorithms available. One of the main complaints made by astronomers was that Forth cannot in be easily integrated with other languages. However, there is a problem with *other* languages in that they generally us a single stack for both data and return addresses. Forth has a considerable advantage in using separate data and return stacks, particularly allowing compound definitions to be implemented. However, this is not a problem *if* the other language uses a separate data stack, in which case, we can integrate the two approaches. This is a failing in the *implementation* of other languages. Another advantage of Forth over other operating systems, is in it's use of multi-tasking by its very nature: the original Forth came with multi-tasking as part and parcel of the system, both terminal and background tasks. In this context, it was easy to implement time-slicing. This allows the system to respond to interrupts very efficiently for both low level and high level events. By using a task-specific user-table means that the response to interrupts is as good as it gets. The problem here was to build a compiler for other languages which could be integrated into Forth. The cornerstones in generating such a compiler are scanning and parsing. For the first part, scanning, is particularly easy in Forth, where the source text is done (mostly) by splitting it into 'words', but in languages such as C or Pascal, a rather more complicated process is needed, e.g. x=y+z; will need to be separated into six tokens, but all juxtaposed in this case. The second part is to treat the grammar as a BNF file that can be simply INCLUDEd. Each 'production' is similar to a series of named colon-style definitions, one for each type of 'nonterminal' (left-hand side). It is a bit more complicated than a series of colon definitions in that there may have more that one 'production' for a given definition and two or more productions may begin with the the same phrase. This can be used for to generate parse tables in Forth for an given language that uses a BNF-style of grammar. The resulting tables can both be used in Forth itself or on any other platform and can be used to generates compilers for different languages. The result is both LL and LR compatible - usually a simplified version of LR compatibility is used but Forth generates fully LR compatible tables. The problem comes with the use of parsing instead of scanning where the Forth standard meets the other languages head-on. Of course, it sounds better if you use 'parsing' instead of the rather more mundane 'scanning'.
[toc] | [prev] | [next] | [standalone]
| From | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
|---|---|
| Date | 2013-08-02 15:25 +0000 |
| Message-ID | <2013Aug2.172531@mips.complang.tuwien.ac.at> |
| In reply to | #24933 |
Ian van Breda <igvb@btopenworld.com> writes:
>The argument that it has always been done this way is surely not valid
>in the face of the fundamental usage of the term, parsing, elsewhere:
>Forth cannot ignore the outside world and, at least in this case, it is
>irrefutably common practice to use scanning for this type of process.
>Forth seems to be unique in using the term parsing instead of scanning.
>
>This problem is highlighted when we try to use Forth to generate parsers
>for other languages, for which Forth works extremely well.
I have written (and documented) a parser generator in Forth
<http://www.complang.tuwien.ac.at/forth/gray.zip>, and did not
encounter any problems. Also, the feedback I got back from users did
not show any confusion between compiler terminology and Forth
terminology.
More generally, I don't remember any confusion between Forth
terminology with other terminology on scanning/parsing in other
contexts (and parsing words have been discussed repeatedly).
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2013: http://www.euroforth.org/ef13/
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl (Albert van der Horst) |
|---|---|
| Date | 2013-08-06 10:28 +0000 |
| Message-ID | <5200cfb4$0$612$e4fe514c@dreader34.news.xs4all.nl> |
| In reply to | #24933 |
In article <CE216AB0.36B5%igvb@btopenworld.com>, Ian van Breda <igvb@btopenworld.com> wrote: >The terms 'parse' and 'parsing' are widely use in the documentation on >both in the original ANS Standard of 1994, [1], and in the draft >Proposal, [2]. > >However, the term 'parsing', in both the treatment of natural language >and the theory of computer languages, implies analysis of the structure >of a sentence or language construct. In computing it refers to checking >that the tokens extracted from the source code of a program follow the >grammatical rules of the language [3] and [4]. There are many examples >in the literature. According to this definition there is no parsing in Forth. Blame us for using the term parsing for the nearest thing. >Ian van Breda Groetjes Albert > > -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | Ian van Breda <igvb@btopenworld.com> |
|---|---|
| Date | 2013-08-06 14:31 +0100 |
| Message-ID | <CE26B94B.3874%igvb@btopenworld.com> |
| In reply to | #25017 |
in article Albert van der Horst at wrote on 06/08/2013 11:28: > In article <CE216AB0.36B5%igvb@btopenworld.com>, > Ian van Breda <igvb@btopenworld.com> wrote: >> The terms 'parse' and 'parsing' are widely use in the documentation on >> both in the original ANS Standard of 1994, [1], and in the draft >> Proposal, [2]. >> >> However, the term 'parsing', in both the treatment of natural language >> and the theory of computer languages, implies analysis of the structure >> of a sentence or language construct. In computing it refers to checking >> that the tokens extracted from the source code of a program follow the >> grammatical rules of the language [3] and [4]. There are many examples >> in the literature. > > According to this definition there is no parsing in Forth. > Blame us for using the term parsing for the nearest thing. >> >> It would have been less confusing if the standards committee had used scanning instead of parsing.
[toc] | [prev] | [next] | [standalone]
| From | albert@spenarnc.xs4all.nl (Albert van der Horst) |
|---|---|
| Date | 2013-08-06 17:47 +0000 |
| Message-ID | <520136b5$0$3184$e4fe514c@dreader36.news.xs4all.nl> |
| In reply to | #25030 |
In article <CE26B94B.3874%igvb@btopenworld.com>, Ian van Breda <igvb@btopenworld.com> wrote: >in article Albert van der Horst at wrote on 06/08/2013 11:28: > >> In article <CE216AB0.36B5%igvb@btopenworld.com>, >> Ian van Breda <igvb@btopenworld.com> wrote: >>> The terms 'parse' and 'parsing' are widely use in the documentation on >>> both in the original ANS Standard of 1994, [1], and in the draft >>> Proposal, [2]. >>> >>> However, the term 'parsing', in both the treatment of natural language >>> and the theory of computer languages, implies analysis of the structure >>> of a sentence or language construct. In computing it refers to checking >>> that the tokens extracted from the source code of a program follow the >>> grammatical rules of the language [3] and [4]. There are many examples >>> in the literature. >> >> According to this definition there is no parsing in Forth. >> Blame us for using the term parsing for the nearest thing. >>> >>> >It would have been less confusing if the standards committee had used >scanning instead of parsing. I doubt it. The standards committee used decennia-old Forth terminology. I've redefined a non-quite->IN to PP (parse-pointer) in yourforth. It points to the next character in the input stream. Should that be named SP (scan-pointer)? Groetjes Albert -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
[toc] | [prev] | [next] | [standalone]
| From | Hans Bezemer <the.beez.speaks@gmail.com> |
|---|---|
| Date | 2013-08-11 19:20 +0200 |
| Message-ID | <5207c6fb$0$15869$e4fe514c@news2.news.xs4all.nl> |
| In reply to | #24933 |
Ian van Breda wrote: > The universally accepted term for extracting language tokens from the > input source code, as elsewhere in computing, is 'scanning'. In Forth > this is done by extracting words delimited by spaces, tabs and line > endings. I don't agree: I know the process as "tokenizing" (see: http://en.wikipedia.org/wiki/Tokenization). I know that very well, because tokenizing is a separate process in my 4tH compiler (it not only eliminates "double" delimiters and the various whitespaces, it also takes care of strings and such) which also classifies tokens to type while doing it. After that, if parses the whole bunch AND compiles it at the same time by simply executing its compile time behavior. In Forth similar stuff is done: a whitespace is scanned and then words are executed - which IMHO is "analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar". In as much Forth HAS a formal grammer. E.g. IF is parsed by scanning a whitespace, executing its compile time behavior and compiling a "0-BRANCH" instruction. Many things are happening at the same time, which causes all this confusion. Calling it "scanning" is as much right or wrong as calling it "parsing" (in its classic meaning), "execution" or "compilation". Note we also say we "compile" a Forth source while true (wordwise) compilation is just a small part of the process. So, in short: don't try to be a purist when the whole vocabulary concerning the process is blurred already. You won't fix anything, at best you're confusing things even more. Hans Bezemer
[toc] | [prev] | [next] | [standalone]
| From | Ian van Breda <igvb@btopenworld.com> |
|---|---|
| Date | 2013-08-15 11:07 +0100 |
| Message-ID | <CE3266E0.396C%igvb@btopenworld.com> |
| In reply to | #25103 |
Hans Bezemer wrote on 11/08/2013 18:20:
> Ian van Breda wrote:
>> The universally accepted term for extracting language tokens from the
>> input source code, as elsewhere in computing, is 'scanning'. In Forth
>> this is done by extracting words delimited by spaces, tabs and line
>> endings.
> I don't agree: I know the process as "tokenizing" (see:
> http://en.wikipedia.org/wiki/Tokenization). I know that very well, because
> tokenizing is a separate process in my 4tH compiler (it not only
> eliminates "double" delimiters and the various whitespaces, it also takes
> care of strings and such) which also classifies tokens to type while doing
> it.
>
> After that, if parses the whole bunch AND compiles it at the same time by
> simply executing its compile time behavior.
>
> In Forth similar stuff is done: a whitespace is scanned and then words are
> executed - which IMHO is "analysing a string of symbols, either in natural
> language or in computer languages, according to the rules of a formal
> grammar". In as much Forth HAS a formal grammer.
>
> E.g. IF is parsed by scanning a whitespace, executing its compile time
> behavior and compiling a "0-BRANCH" instruction. Many things are happening
> at the same time, which causes all this confusion.
>
> Calling it "scanning" is as much right or wrong as calling it "parsing" (in
> its classic meaning), "execution" or "compilation".
>
> Note we also say we "compile" a Forth source while true (wordwise)
> compilation is just a small part of the process.
>
> So, in short: don't try to be a purist when the whole vocabulary concerning
> the process is blurred already. You won't fix anything, at best you're
> confusing things even more.
>
> Hans Bezemer
>
Wikipedia indeed refers to tokenisation but it is more commonly known as
scanning, see books by Gries, Fischer et al, Waite et al. Appel refers to
Lexical Analysis which amounts to the same thing.
All of the above refer to parsing of the formal grammar. See, for example,
Fischer et al, 'Crafting a Compiler', Chapter 3 'Theory and Practice of
Scanning' and Chapter 4 'Formal Grammars and Parsing', which separate the
scanning/tokenisation from the parsing process.
Forth does not have a formal grammar (e.g. BNF) in the sense that C and
Pascal do. However, we still need to use, e.g., IF ... ELSE ... THEN, which
must be complete within a single definition. Many Forth's include some form
of syntax checking, to ensure that the IF/ELSE/THEN is complete and also to
check that the return stack is balanced. For example, we can use
IF ... >R ELSE ... >R THEN ... R> and still balance the return stack.
But this follows both the scanning/tokenisation and parsing processes.
IF and '0-Brach' are something of a red herring in this context, in that
they come later the compiling process. In particular, it comes under the
general heading of code generation.
It is this lack of a formal grammar (in the BNF sense) that gives Forth its
exceptional power. For example, a Pascal LL/LR parser can be built on the
BNF formal grammar by simply INCLUDEing a file containing the 'productions'
of the grammar. e.g.
<program_parameter_list> -> '(' <file_variable_list> ')' ;
where *all* of the words are simply Forth definitions held in the
dictionary, a unique feature of Forth. This can be used to build a series
if parse tables for a variety of languages that are independent of the
platform on which the compiler is run.
In using the terms 'scanning' and, particularly, 'parsing', I am using the
terms as used elsewhere, both in compiler technology and in the theory of
natural languages. The definition is also as described in The Oxford
Illustrated Dictionary and certainly suffers no 'blurring' or confusion in
its meaning.
This is nothing to do with being a 'purist'. I am not 'trying to fix'
anything either. Instead, I am seeking a unification and clarification
between Forth and other languages and for which the meaning is very clear
and unequivocal. Of course, if you think that this is a waste of time, then
we shall just have to disagree.
Ian van Breda
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.forth
csiph-web