Groups > comp.lang.c > #120416 > unrolled thread

Lexical Elements

Started by	David Kleinecke <dkleinecke@gmail.com>
First post	2017-09-27 19:03 -0700
Last post	2017-09-29 01:19 +0100
Articles	14 on this page of 34 — 9 participants

Back to article view | Back to comp.lang.c

  Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 19:03 -0700
    Re: Lexical Elements "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-09-28 05:33 +0200
      Re: Lexical Elements James Kuyper <jameskuyper@verizon.net> - 2017-09-28 00:19 -0400
        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:09 -0700
          Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 08:31 -0700
            Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:53 -0700
              Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:16 -0700
                Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:51 -0700
                  Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:42 -0700
              Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:37 -0700
                Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:16 -0700
                  Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 18:39 -0700
                    Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 19:47 -0700
                      Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 20:29 -0700
                        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 22:36 -0700
                      Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-29 08:47 -0700
                        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-29 11:23 -0700
                      Re: Lexical Elements Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-09-29 18:27 +0100
          Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:13 -0700
      Re: Lexical Elements Richard Damon <Richard@Damon-Family.org> - 2017-09-28 08:15 -0400
    Re: Lexical Elements jameskuyper@verizon.net - 2017-09-27 21:03 -0700
      Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:16 -0700
        Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:45 -0700
          Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:58 -0700
            Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:29 -0700
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:52 -0700
                Re: Lexical Elements Joe Pfeiffer <pfeiffer@cs.nmsu.edu> - 2017-09-28 17:40 -0600
                Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:54 -0700
            Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:40 -0700
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:12 -0700
            Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 21:04 +0100
              Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 22:12 +0100
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:15 -0700
                Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-29 01:19 +0100

Page 2 of 2 — ← Prev page 1 [2]

#120419

From	jameskuyper@verizon.net
Date	2017-09-27 21:03 -0700
Message-ID	<4e3d4467-fd5b-42f1-9e6a-335cf1ce88bb@googlegroups.com>
In reply to	#120416

On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote:
> I am having trouble reading the standard on this one small
> point. I will quote the C89 standard but I have checked the
> C11 standard and seems to have exactly the same problem.
> 
> The question is - exactly what does the standard mean by
> "lexical form" and "lexical elements". There is no
> definition withing the standard and (apart from a forward
> reference in 5.1.1.2) they are only appear in 6.4 (6.1
> in C89). 
> 
> There are expressions like "Each preprocessing token that is
> concerted to a token shall have the lexical form of a
> keyword, an identifier, ...". This says each pp-token has
> something called a "lexical form" which seems to have five
> values (six in C89). The pp-tokens in my preprocessor are
> unsigned ints (and are unchanged as tokens.

Each preprocessing token corresponds to and is defined in terms of a sequence of
source code characters; the same is true of a token. Section 6.4 describes which
sequences of source code characters qualify as each type of preprocessing token
or token. I could understand using unsigned ints to represent the punctuators
and character constants, but I don't see how that's a reasonable way to
represent any of the other kinds of preprocessing tokens. What unsigned integer
do you use to represent the identifier department_sales_per_month? There's a lot
of different possible identifiers, more than easily be represented as distinct
unsigned integer values on most platforms. The same is true of string literals.
I'd expect most of the types of preprocessing tokens to be represented
internally by strings of source code characters, at least until translation
phase 7, when they can finally be parsed and converted into other forms.

> I observe that 5.1.1.2 says "decomposed". Perhaps - it
> seems - the lexical form is the character string that
> originated the token (minus the quotes on strings and
> characters). Is this reading correct?

I don't believe so. The lexical form of a string literal includes the quote
characters, as well as the prefix, if any (see 6.4.5p1). Similar comments apply
to character constants. You might be able to get away with considering those
things to be implied by the fact that they have been parsed as string literals
or character constants, so you don't actually have to store the single or double
quotes. However, when it says "lexical form", it's referring to the entire
sequence of source code characters that make up a given token, even the ones
that you choose not to store explicitly.

> If so the "shall" that every pp-token that becomes a token
> must have one the approved shapes leaves us with possibly
> many pp-tokens that don't become tokens. I assume that
> what happens to them is more undefined behavior.

A "shall" indicates undefined behavior only if it does not occur in a
constraints section (4p2). That "shall" occurs in 6.4p2, which is a constraints
section. Therefore, preprocessing tokens which violate this rule are constraint
violations. At least one diagnostic is required for any program that contains
such a violation. As with all constraint violations, if you wish, your compiler
can reject such code. If it chooses to continue translation, and if user chooses
to execute the resulting programs despite having received the diagnostic, the
behavior is indeed undefined - but first there must be a diagnostic

[toc] | [prev] | [next] | [standalone]

#120424

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-27 22:16 -0700
Message-ID	<027ced74-afdc-4db1-aa14-7b3fb7a22295@googlegroups.com>
In reply to	#120419

On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote:
> On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote:
> > I am having trouble reading the standard on this one small
> > point. I will quote the C89 standard but I have checked the
> > C11 standard and seems to have exactly the same problem.
> > 
> > The question is - exactly what does the standard mean by
> > "lexical form" and "lexical elements". There is no
> > definition withing the standard and (apart from a forward
> > reference in 5.1.1.2) they are only appear in 6.4 (6.1
> > in C89). 
> > 
> > There are expressions like "Each preprocessing token that is
> > concerted to a token shall have the lexical form of a
> > keyword, an identifier, ...". This says each pp-token has
> > something called a "lexical form" which seems to have five
> > values (six in C89). The pp-tokens in my preprocessor are
> > unsigned ints (and are unchanged as tokens.
> 
> Each preprocessing token corresponds to and is defined in terms of a sequence of
> source code characters; the same is true of a token. Section 6.4 describes which
> sequences of source code characters qualify as each type of preprocessing token
> or token. I could understand using unsigned ints to represent the punctuators
> and character constants, but I don't see how that's a reasonable way to
> represent any of the other kinds of preprocessing tokens. What unsigned integer
> do you use to represent the identifier department_sales_per_month? There's a lot
> of different possible identifiers, more than easily be represented as distinct
> unsigned integer values on most platforms. The same is true of string literals.

There are around one hundred pre-defined tokens which I
number from 1 to whatever. Then each new identifier, number,
string or character gets the next token number. The tokens
are indexes into an array of token data records (currently
a union of several different structs).

I admit to being surprised anyone would use any different
approach although instead of indexes I might have used
data pointers.






> I'd expect most of the types of preprocessing tokens to be represented
> internally by strings of source code characters, at least until translation
> phase 7, when they can finally be parsed and converted into other forms.
> 
> > I observe that 5.1.1.2 says "decomposed". Perhaps - it
> > seems - the lexical form is the character string that
> > originated the token (minus the quotes on strings and
> > characters). Is this reading correct?
> 
> I don't believe so. The lexical form of a string literal includes the quote
> characters, as well as the prefix, if any (see 6.4.5p1). Similar comments apply
> to character constants. You might be able to get away with considering those
> things to be implied by the fact that they have been parsed as string literals
> or character constants, so you don't actually have to store the single or double
> quotes. However, when it says "lexical form", it's referring to the entire
> sequence of source code characters that make up a given token, even the ones
> that you choose not to store explicitly.
> 
> > If so the "shall" that every pp-token that becomes a token
> > must have one the approved shapes leaves us with possibly
> > many pp-tokens that don't become tokens. I assume that
> > what happens to them is more undefined behavior.
> 
> A "shall" indicates undefined behavior only if it does not occur in a
> constraints section (4p2). That "shall" occurs in 6.4p2, which is a constraints
> section. Therefore, preprocessing tokens which violate this rule are constraint
> violations. At least one diagnostic is required for any program that contains
> such a violation. As with all constraint violations, if you wish, your compiler
> can reject such code. If it chooses to continue translation, and if user chooses
> to execute the resulting programs despite having received the diagnostic, the
> behavior is indeed undefined - but first there must be a diagnostic

[toc] | [prev] | [next] | [standalone]

#120457

From	jameskuyper@verizon.net
Date	2017-09-28 09:45 -0700
Message-ID	<5e681899-3b70-44c5-b919-5dccadf62672@googlegroups.com>
In reply to	#120424

On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote:
> On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote:
> > On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote:
> > > I am having trouble reading the standard on this one small
> > > point. I will quote the C89 standard but I have checked the
> > > C11 standard and seems to have exactly the same problem.
> > > 
> > > The question is - exactly what does the standard mean by
> > > "lexical form" and "lexical elements". There is no
> > > definition withing the standard and (apart from a forward
> > > reference in 5.1.1.2) they are only appear in 6.4 (6.1
> > > in C89). 
> > > 
> > > There are expressions like "Each preprocessing token that is
> > > concerted to a token shall have the lexical form of a
> > > keyword, an identifier, ...". This says each pp-token has
> > > something called a "lexical form" which seems to have five
> > > values (six in C89). The pp-tokens in my preprocessor are
> > > unsigned ints (and are unchanged as tokens.
> > 
> > Each preprocessing token corresponds to and is defined in terms of a sequence of
> > source code characters; the same is true of a token. Section 6.4 describes which
> > sequences of source code characters qualify as each type of preprocessing token
> > or token. I could understand using unsigned ints to represent the punctuators
> > and character constants, but I don't see how that's a reasonable way to
> > represent any of the other kinds of preprocessing tokens. What unsigned integer
> > do you use to represent the identifier department_sales_per_month? There's a lot
> > of different possible identifiers, more than easily be represented as distinct
> > unsigned integer values on most platforms. The same is true of string literals.
> 
> There are around one hundred pre-defined tokens which I
> number from 1 to whatever. ...

Those presumably correspond to keywords and punctuators.

> ... Then each new identifier, number,
> string or character gets the next token number. The tokens
> are indexes into an array of token data records (currently
> a union of several different structs).
> 
> I admit to being surprised anyone would use any different
> approach although instead of indexes I might have used
> data pointers.

I would probably use a similar data structure, but I would describe it
differently: your token data records are (presumably) where you store the actual
lexical form of the token, your indices just identify which token data record
the token is stored in.

For each identifier, you need to retain the full sequence of characters that
make up the identifier, so your compiler can recognize the fact that a later
occurrence of the same identifier is in fact an occurrence of the same
identifier. It needn't be the exact same sequence that appeared in the source
code: you need to recognize members of the extended character set as matches to
corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to
the corresponding 4-digit UCNs, which implies that what you actually store
should be a normalized version of the identifier.
You can only dispose of that information once you've applied all of C's rules
about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for
identifiers with external linkage until linkage has been resolved, in
translation phase 8.

Similarly, you have to retain the complete list of source code characters
corresponding to each string literal or character constant until translation
phase 5, where they get converted into corresponding characters in the execution
character set - which must then be stored somewhere in the translated program,
unless the string turns out to be unused and therefore discardable. For most
string literals, you can never completely discard the string of characters.

[toc] | [prev] | [next] | [standalone]

#120477

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 11:58 -0700
Message-ID	<3c9e515f-7fa2-43e0-8b0d-749586b74554@googlegroups.com>
In reply to	#120457

On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote:
> On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote:
> > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote:
> > > On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote:
> > > > I am having trouble reading the standard on this one small
> > > > point. I will quote the C89 standard but I have checked the
> > > > C11 standard and seems to have exactly the same problem.
> > > > 
> > > > The question is - exactly what does the standard mean by
> > > > "lexical form" and "lexical elements". There is no
> > > > definition withing the standard and (apart from a forward
> > > > reference in 5.1.1.2) they are only appear in 6.4 (6.1
> > > > in C89). 
> > > > 
> > > > There are expressions like "Each preprocessing token that is
> > > > concerted to a token shall have the lexical form of a
> > > > keyword, an identifier, ...". This says each pp-token has
> > > > something called a "lexical form" which seems to have five
> > > > values (six in C89). The pp-tokens in my preprocessor are
> > > > unsigned ints (and are unchanged as tokens.
> > > 
> > > Each preprocessing token corresponds to and is defined in terms of a sequence of
> > > source code characters; the same is true of a token. Section 6.4 describes which
> > > sequences of source code characters qualify as each type of preprocessing token
> > > or token. I could understand using unsigned ints to represent the punctuators
> > > and character constants, but I don't see how that's a reasonable way to
> > > represent any of the other kinds of preprocessing tokens. What unsigned integer
> > > do you use to represent the identifier department_sales_per_month? There's a lot
> > > of different possible identifiers, more than easily be represented as distinct
> > > unsigned integer values on most platforms. The same is true of string literals.
> > 
> > There are around one hundred pre-defined tokens which I
> > number from 1 to whatever. ...
> 
> Those presumably correspond to keywords and punctuators.
> 
> > ... Then each new identifier, number,
> > string or character gets the next token number. The tokens
> > are indexes into an array of token data records (currently
> > a union of several different structs).
> > 
> > I admit to being surprised anyone would use any different
> > approach although instead of indexes I might have used
> > data pointers.
> 
> I would probably use a similar data structure, but I would describe it
> differently: your token data records are (presumably) where you store the actual
> lexical form of the token, your indices just identify which token data record
> the token is stored in.
> 
> For each identifier, you need to retain the full sequence of characters that
> make up the identifier, so your compiler can recognize the fact that a later
> occurrence of the same identifier is in fact an occurrence of the same
> identifier. It needn't be the exact same sequence that appeared in the source
> code: you need to recognize members of the extended character set as matches to
> corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to
> the corresponding 4-digit UCNs, which implies that what you actually store
> should be a normalized version of the identifier.
> You can only dispose of that information once you've applied all of C's rules
> about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for
> identifiers with external linkage until linkage has been resolved, in
> translation phase 8.
> 
> Similarly, you have to retain the complete list of source code characters
> corresponding to each string literal or character constant until translation
> phase 5, where they get converted into corresponding characters in the execution
> character set - which must then be stored somewhere in the translated program,
> unless the string turns out to be unused and therefore discardable. For most
> string literals, you can never completely discard the string of characters.

Everything you say sounds right. But my question is ontological.
What exactly is THE token?

[toc] | [prev] | [next] | [standalone]

#120481

From	jameskuyper@verizon.net
Date	2017-09-28 12:29 -0700
Message-ID	<8c07cb3a-242f-4f5c-9430-522ee20b888c@googlegroups.com>
In reply to	#120477

On Thursday, September 28, 2017 at 2:58:45 PM UTC-4, David Kleinecke wrote:
> On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote:
> > On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote:
> > > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote:
...
> > > > Each preprocessing token corresponds to and is defined in terms of a sequence of
> > > > source code characters; the same is true of a token. Section 6.4 describes which
> > > > sequences of source code characters qualify as each type of preprocessing token
> > > > or token. I could understand using unsigned ints to represent the punctuators
> > > > and character constants, but I don't see how that's a reasonable way to
> > > > represent any of the other kinds of preprocessing tokens. What unsigned integer
> > > > do you use to represent the identifier department_sales_per_month? There's a lot
> > > > of different possible identifiers, more than easily be represented as distinct
> > > > unsigned integer values on most platforms. The same is true of string literals.
> > > 
> > > There are around one hundred pre-defined tokens which I
> > > number from 1 to whatever. ...
> > 
> > Those presumably correspond to keywords and punctuators.
> > 
> > > ... Then each new identifier, number,
> > > string or character gets the next token number. The tokens
> > > are indexes into an array of token data records (currently
> > > a union of several different structs).
> > > 
> > > I admit to being surprised anyone would use any different
> > > approach although instead of indexes I might have used
> > > data pointers.
> > 
> > I would probably use a similar data structure, but I would describe it
> > differently: your token data records are (presumably) where you store the actual
> > lexical form of the token, your indices just identify which token data record
> > the token is stored in.
> > 
> > For each identifier, you need to retain the full sequence of characters that
> > make up the identifier, so your compiler can recognize the fact that a later
> > occurrence of the same identifier is in fact an occurrence of the same
> > identifier. It needn't be the exact same sequence that appeared in the source
> > code: you need to recognize members of the extended character set as matches to
> > corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to
> > the corresponding 4-digit UCNs, which implies that what you actually store
> > should be a normalized version of the identifier.
> > You can only dispose of that information once you've applied all of C's rules
> > about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for
> > identifiers with external linkage until linkage has been resolved, in
> > translation phase 8.
> > 
> > Similarly, you have to retain the complete list of source code characters
> > corresponding to each string literal or character constant until translation
> > phase 5, where they get converted into corresponding characters in the execution
> > character set - which must then be stored somewhere in the translated program,
> > unless the string turns out to be unused and therefore discardable. For most
> > string literals, you can never completely discard the string of characters.
> 
> Everything you say sounds right. But my question is ontological.
> What exactly is THE token?

If what I said doesn't constitute a sufficient answer to your question, then I
don't understand your question. Maybe if you concentrate on practical
consequences of your ontological question, I could help. Give me a practical
question about how your compiler should do something, whose answer depends upon
the answer to that ontological question. It's entirely possible that I can
answer the practical question, without understanding the connection you see
between that question and the ontological one.
If the ontological question's answer doesn't have any practical consequences,
you really don't need that answer, however much you might like to have it.

[toc] | [prev] | [next] | [standalone]

#120505

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 15:52 -0700
Message-ID	<022ed8a4-f2b6-48f6-b3e3-23760998ef32@googlegroups.com>
In reply to	#120481

On Thursday, September 28, 2017 at 12:29:41 PM UTC-7, james...@verizon.net wrote:
> On Thursday, September 28, 2017 at 2:58:45 PM UTC-4, David Kleinecke wrote:
> > On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote:
> > > On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote:
> > > > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote:
> ...
> > > > > Each preprocessing token corresponds to and is defined in terms of a sequence of
> > > > > source code characters; the same is true of a token. Section 6.4 describes which
> > > > > sequences of source code characters qualify as each type of preprocessing token
> > > > > or token. I could understand using unsigned ints to represent the punctuators
> > > > > and character constants, but I don't see how that's a reasonable way to
> > > > > represent any of the other kinds of preprocessing tokens. What unsigned integer
> > > > > do you use to represent the identifier department_sales_per_month? There's a lot
> > > > > of different possible identifiers, more than easily be represented as distinct
> > > > > unsigned integer values on most platforms. The same is true of string literals.
> > > > 
> > > > There are around one hundred pre-defined tokens which I
> > > > number from 1 to whatever. ...
> > > 
> > > Those presumably correspond to keywords and punctuators.
> > > 
> > > > ... Then each new identifier, number,
> > > > string or character gets the next token number. The tokens
> > > > are indexes into an array of token data records (currently
> > > > a union of several different structs).
> > > > 
> > > > I admit to being surprised anyone would use any different
> > > > approach although instead of indexes I might have used
> > > > data pointers.
> > > 
> > > I would probably use a similar data structure, but I would describe it
> > > differently: your token data records are (presumably) where you store the actual
> > > lexical form of the token, your indices just identify which token data record
> > > the token is stored in.
> > > 
> > > For each identifier, you need to retain the full sequence of characters that
> > > make up the identifier, so your compiler can recognize the fact that a later
> > > occurrence of the same identifier is in fact an occurrence of the same
> > > identifier. It needn't be the exact same sequence that appeared in the source
> > > code: you need to recognize members of the extended character set as matches to
> > > corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to
> > > the corresponding 4-digit UCNs, which implies that what you actually store
> > > should be a normalized version of the identifier.
> > > You can only dispose of that information once you've applied all of C's rules
> > > about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for
> > > identifiers with external linkage until linkage has been resolved, in
> > > translation phase 8.
> > > 
> > > Similarly, you have to retain the complete list of source code characters
> > > corresponding to each string literal or character constant until translation
> > > phase 5, where they get converted into corresponding characters in the execution
> > > character set - which must then be stored somewhere in the translated program,
> > > unless the string turns out to be unused and therefore discardable. For most
> > > string literals, you can never completely discard the string of characters.
> > 
> > Everything you say sounds right. But my question is ontological.
> > What exactly is THE token?
> 
> If what I said doesn't constitute a sufficient answer to your question, then I
> don't understand your question. Maybe if you concentrate on practical
> consequences of your ontological question, I could help. Give me a practical
> question about how your compiler should do something, whose answer depends upon
> the answer to that ontological question. It's entirely possible that I can
> answer the practical question, without understanding the connection you see
> between that question and the ontological one.
> If the ontological question's answer doesn't have any practical consequences,
> you really don't need that answer, however much you might like to have it.

I don't see how I can conform to what the standard expects without
knowing what a token is. What is your definition of a token?

[toc] | [prev] | [next] | [standalone]

#120511

From	Joe Pfeiffer <pfeiffer@cs.nmsu.edu>
Date	2017-09-28 17:40 -0600
Message-ID	<1bo9pugznx.fsf@pfeifferfamily.net>
In reply to	#120505

David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 12:29:41 PM UTC-7, james...@verizon.net wrote:
>> 
>> If what I said doesn't constitute a sufficient answer to your question, then I
>> don't understand your question. Maybe if you concentrate on practical
>> consequences of your ontological question, I could help. Give me a practical
>> question about how your compiler should do something, whose answer depends upon
>> the answer to that ontological question. It's entirely possible that I can
>> answer the practical question, without understanding the connection you see
>> between that question and the ontological one.
>> If the ontological question's answer doesn't have any practical consequences,
>> you really don't need that answer, however much you might like to have it.
>
> I don't see how I can conform to what the standard expects without
> knowing what a token is. What is your definition of a token?

Operationally, it's what the lexical analyzer hands off to the parser.

[toc] | [prev] | [next] | [standalone]

#120513

From	jameskuyper@verizon.net
Date	2017-09-28 16:54 -0700
Message-ID	<2b789c99-64e5-4fca-8491-daa67bb49314@googlegroups.com>
In reply to	#120505

On Thursday, September 28, 2017 at 6:53:11 PM UTC-4, David Kleinecke wrote:
...
> I don't see how I can conform to what the standard expects without
> knowing what a token is. What is your definition of a token?

A token is any preprocessing-token whose corresponding string of source code
characters matches, during translation phase 7 (5.1.1.2p7), the grammar
rule specified for a 'token' in section 6.4p1. A preprocessing token, in turn,
is any sequence of source code characters identified during translation phase 3
(5.1.1.2p3) as matching the grammar rule for "preprocessing-token" in section
6.4p1.

By specifying that these rules apply only during specific translation phases,
I'm emphasizing that the processing specified for the other translation phases
affects what qualifies as preprocessing-token or as a token.

If there's anything you find unclear about that, it would be best if you try to
convert your uncertainty into a practical question about what is permitted or
prohibited with regards to how you implement the standard's rules. Your usual
questions are too abstract to be answered in a way that you'll understand, when
only you know what the words you're using actually mean to you.

[toc] | [prev] | [next] | [standalone]

#120483

From	Keith Thompson <kst-u@mib.org>
Date	2017-09-28 12:40 -0700
Message-ID	<lnfub6wr0l.fsf@kst-u.example.com>
In reply to	#120477

David Kleinecke <dkleinecke@gmail.com> writes:
[...]
> Everything you say sounds right. But my question is ontological.
> What exactly is THE token?

As far as the language is concerned, a token is a syntactic element of a
C translation unit, a sequence of characters in a source file.  Given:

    int x = 42;

you have 5 tokens, consisting of 3, 1, 1, 2, and 1 characters
respectively.  Tokens are distinguished from non-token character
sequences (like "42;" in the above) syntactically.

A compiler will thpically have some internal representation of a token.
The details of that representation are unspecified, and generally are
whatever is convenient for use in the compiler.

Does that answer your question?

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#120507

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 16:12 -0700
Message-ID	<7f17072d-c202-4590-8f56-0d003a2e8c39@googlegroups.com>
In reply to	#120483

On Thursday, September 28, 2017 at 12:40:50 PM UTC-7, Keith Thompson wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
> [...]
> > Everything you say sounds right. But my question is ontological.
> > What exactly is THE token?
> 
> As far as the language is concerned, a token is a syntactic element of a
> C translation unit, a sequence of characters in a source file.  Given:
> 
>     int x = 42;
> 
> you have 5 tokens, consisting of 3, 1, 1, 2, and 1 characters
> respectively.  Tokens are distinguished from non-token character
> sequences (like "42;" in the above) syntactically.
> 
> A compiler will thpically have some internal representation of a token.
> The details of that representation are unspecified, and generally are
> whatever is convenient for use in the compiler.
> 
> Does that answer your question?

I think that is the dominant idea but I see no reason for it.
I would rather use the word "token" to mean the internal
representation which has the overt character string as one
of its attributes. 

Consider the situation where an identifier is re-used within
a block. The new (inner) meaning is different than the old
(outer) meaning but the character strings are the same. I
implement the situation by creating a new token (which is
destroyed when the code leaves the block).

[toc] | [prev] | [next] | [standalone]

#120486

From	bartc <bc@freeuk.com>
Date	2017-09-28 21:04 +0100
Message-ID	<6BczB.1462043$nU3.1408556@fx43.am4>
In reply to	#120477

On 28/09/2017 19:58, David Kleinecke wrote:

> Everything you say sounds right. But my question is ontological.
> What exactly is THE token?

Below is the list of tokens I use in my C project. Each is represented 
by some index.

Up to EOF, each token represents some special symbol or punctuation. 
After that, they are mainly numbers, keywords and identifiers.

Those marked * have an associated value - number, string constant or 
identifier.

Those marked + are split into two or more sub-tokens indicated by a 
further value (different preprocessor directives for example).

These mostly serve both as pre-processor tokens, and tokens used for 
parsing. (C is like that, nothing can be straightforward.)

Don't know if this helps.

----------------------------------------

ERROR
DOT
IDOT   "->"
LEXHASH
HASH
LITHASH
HASHHASH
COMMA
SEMI
COLON
ASSIGN
LBRACK
RBRACK
LSQ
RSQ
LCURLY   "{"
RCURLY   "}"
QUESTION
CURL
ELLIPSIS
ADD
SUB
MUL
DIV
REM
IOR  "|"
IAND
IXOR
ORL  "||"
ANDL
SHL
SHR
INOT
NOTL
INCR
DECR
EQ
NE
LT
LE
GE
GT
ADDTO  "+="
SUBTO
MULTO
DIVTO
REMTO
IORTO
IANDTO
IXORTO
SHLTO
SHRTO
EOL
EOF

RAWNUMBER *
INTCONST *
REALCONST *
CHARCONST *
WCHARCONST *
STRINGCONST *
WSTRINGCONST *
WHITESPACE
PLACEHOLDER

NAME *
SOURCEDIR +
PREDEFMACRO +

TYPESPEC +
IF
ELSE
CASE
DEFAULT
FOR
WHILE
DO
RETURN
BREAK
CONTINUE
GOTO
SWITCH
STRUCT
UNION
LINKAGE +
TYPEQUAL +
FNSPEC
ALIGNAS
ENUM
CALLCONV
SIZEOF
DEFINED
GENERIC
ALIGNOF

----------------------------------------

(The above aren't the actual tokens used in the source, which are more 
like kbreaksym rather then BREAK. As some would be bad choices for names.)

-- 
bartc

[toc] | [prev] | [next] | [standalone]

#120498

From	bartc <bc@freeuk.com>
Date	2017-09-28 22:12 +0100
Message-ID	<hBdzB.1604944$tu4.400471@fx35.am4>
In reply to	#120486

On 28/09/2017 21:04, bartc wrote:
> On 28/09/2017 19:58, David Kleinecke wrote:
> 
>> Everything you say sounds right. But my question is ontological.
>> What exactly is THE token?
> 
> Below is the list of tokens I use in my C project. Each is represented 
> by some index.
...
> GENERIC
> ALIGNOF
> 
> ----------------------------------------
> 
> (The above aren't the actual tokens used in the source, which are more 
> like kbreaksym rather then BREAK. As some would be bad choices for names.)

The tokens as declared in the actual project (not C despite it saying C):

   https://pastebin.com/P5P3gcF1


As for what a token is, each one is more completely represented in my 
implementation by this struct (in C this time):

struct _tokenrec {    // 32 bytes total
     union {
         int64   value;
         double  xvalue;
         uint64  uvalue;
         byte *  svalue;
         strec * symptr;
     };
     struct _tokenrec* nexttoken;
     union {
         struct {
             byte    subcode;
             byte    flags;
         };
         uint16  subcodex;
     };
     byte    symbol;
     byte    fileno;
     int32   lineno;
     int32   length;
     union {
         int32   numberoffset;
         int32   paramno;
         int32   pasteno;
     };
};

The token index is stored in .symbol. How the other members are used 
depends on the token. For simple ones ('left bracket') nothing else is 
needed.

-- 
bartc

[toc] | [prev] | [next] | [standalone]

#120508

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 16:15 -0700
Message-ID	<63836496-889f-470c-9257-ee8ad959b93e@googlegroups.com>
In reply to	#120486

On Thursday, September 28, 2017 at 1:04:26 PM UTC-7, Bart wrote:
> On 28/09/2017 19:58, David Kleinecke wrote:
> 
> > Everything you say sounds right. But my question is ontological.
> > What exactly is THE token?
> 
> Below is the list of tokens I use in my C project. Each is represented 
> by some index.
> 
> Up to EOF, each token represents some special symbol or punctuation. 
> After that, they are mainly numbers, keywords and identifiers.
> 
> Those marked * have an associated value - number, string constant or 
> identifier.
> 
> Those marked + are split into two or more sub-tokens indicated by a 
> further value (different preprocessor directives for example).
> 
> These mostly serve both as pre-processor tokens, and tokens used for 
> parsing. (C is like that, nothing can be straightforward.)
> 
> Don't know if this helps.
> 
> ----------------------------------------
> 
> ERROR
> DOT
> IDOT   "->"
> LEXHASH
> HASH
> LITHASH
> HASHHASH
> COMMA
> SEMI
> COLON
> ASSIGN
> LBRACK
> RBRACK
> LSQ
> RSQ
> LCURLY   "{"
> RCURLY   "}"
> QUESTION
> CURL
> ELLIPSIS
> ADD
> SUB
> MUL
> DIV
> REM
> IOR  "|"
> IAND
> IXOR
> ORL  "||"
> ANDL
> SHL
> SHR
> INOT
> NOTL
> INCR
> DECR
> EQ
> NE
> LT
> LE
> GE
> GT
> ADDTO  "+="
> SUBTO
> MULTO
> DIVTO
> REMTO
> IORTO
> IANDTO
> IXORTO
> SHLTO
> SHRTO
> EOL
> EOF
> 
> RAWNUMBER *
> INTCONST *
> REALCONST *
> CHARCONST *
> WCHARCONST *
> STRINGCONST *
> WSTRINGCONST *
> WHITESPACE
> PLACEHOLDER
> 
> NAME *
> SOURCEDIR +
> PREDEFMACRO +
> 
> TYPESPEC +
> IF
> ELSE
> CASE
> DEFAULT
> FOR
> WHILE
> DO
> RETURN
> BREAK
> CONTINUE
> GOTO
> SWITCH
> STRUCT
> UNION
> LINKAGE +
> TYPEQUAL +
> FNSPEC
> ALIGNAS
> ENUM
> CALLCONV
> SIZEOF
> DEFINED
> GENERIC
> ALIGNOF
> 
> ----------------------------------------
> 
> (The above aren't the actual tokens used in the source, which are more 
> like kbreaksym rather then BREAK. As some would be bad choices for names.)
 
There are a lot fewer pre-defined tokens in my world because I am
sticking to C90. But I have a similar list.

[toc] | [prev] | [next] | [standalone]

#120514

From	bartc <bc@freeuk.com>
Date	2017-09-29 01:19 +0100
Message-ID	<HkgzB.803892$uh.558686@fx28.am4>
In reply to	#120508

On 29/09/2017 00:15, David Kleinecke wrote:
> On Thursday, September 28, 2017 at 1:04:26 PM UTC-7, Bart wrote:

>> DOT
...

>> ALIGNOF

> There are a lot fewer pre-defined tokens in my world because I am
> sticking to C90. But I have a similar list.

I think there are 92 tokens above; earlier in the thread you mentioned 
100 tokens. So not so different.

But they may diverge from there because you said every constant and 
identifier will have its own token number. (Every unique constant and 
identifier, or will ABC followed by ABC be represented by two 
consecutive token numbers?)

(And also, some C99 and C11 keywords are recognised if the occur in 
source code, but are then ignored. So a wider range of inputs can be 
processed.)

-- 
bartc

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

Lexical Elements

Contents

#120419

#120424

#120457

#120477

#120481

#120505

#120511

#120513

#120483

#120507

#120486

#120498

#120508

#120514