Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #120416 > unrolled thread
| Started by | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| First post | 2017-09-27 19:03 -0700 |
| Last post | 2017-09-29 01:19 +0100 |
| Articles | 14 on this page of 34 — 9 participants |
Back to article view | Back to comp.lang.c
Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 19:03 -0700
Re: Lexical Elements "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-09-28 05:33 +0200
Re: Lexical Elements James Kuyper <jameskuyper@verizon.net> - 2017-09-28 00:19 -0400
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:09 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 08:31 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:53 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:16 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:51 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:42 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:37 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:16 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 18:39 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 19:47 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 20:29 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 22:36 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-29 08:47 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-29 11:23 -0700
Re: Lexical Elements Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-09-29 18:27 +0100
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:13 -0700
Re: Lexical Elements Richard Damon <Richard@Damon-Family.org> - 2017-09-28 08:15 -0400
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-27 21:03 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:16 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:45 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:58 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:29 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:52 -0700
Re: Lexical Elements Joe Pfeiffer <pfeiffer@cs.nmsu.edu> - 2017-09-28 17:40 -0600
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:54 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:40 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:12 -0700
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 21:04 +0100
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 22:12 +0100
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:15 -0700
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-29 01:19 +0100
Page 2 of 2 — ← Prev page 1 [2]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-27 21:03 -0700 |
| Message-ID | <4e3d4467-fd5b-42f1-9e6a-335cf1ce88bb@googlegroups.com> |
| In reply to | #120416 |
On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote: > I am having trouble reading the standard on this one small > point. I will quote the C89 standard but I have checked the > C11 standard and seems to have exactly the same problem. > > The question is - exactly what does the standard mean by > "lexical form" and "lexical elements". There is no > definition withing the standard and (apart from a forward > reference in 5.1.1.2) they are only appear in 6.4 (6.1 > in C89). > > There are expressions like "Each preprocessing token that is > concerted to a token shall have the lexical form of a > keyword, an identifier, ...". This says each pp-token has > something called a "lexical form" which seems to have five > values (six in C89). The pp-tokens in my preprocessor are > unsigned ints (and are unchanged as tokens. Each preprocessing token corresponds to and is defined in terms of a sequence of source code characters; the same is true of a token. Section 6.4 describes which sequences of source code characters qualify as each type of preprocessing token or token. I could understand using unsigned ints to represent the punctuators and character constants, but I don't see how that's a reasonable way to represent any of the other kinds of preprocessing tokens. What unsigned integer do you use to represent the identifier department_sales_per_month? There's a lot of different possible identifiers, more than easily be represented as distinct unsigned integer values on most platforms. The same is true of string literals. I'd expect most of the types of preprocessing tokens to be represented internally by strings of source code characters, at least until translation phase 7, when they can finally be parsed and converted into other forms. > I observe that 5.1.1.2 says "decomposed". Perhaps - it > seems - the lexical form is the character string that > originated the token (minus the quotes on strings and > characters). Is this reading correct? I don't believe so. The lexical form of a string literal includes the quote characters, as well as the prefix, if any (see 6.4.5p1). Similar comments apply to character constants. You might be able to get away with considering those things to be implied by the fact that they have been parsed as string literals or character constants, so you don't actually have to store the single or double quotes. However, when it says "lexical form", it's referring to the entire sequence of source code characters that make up a given token, even the ones that you choose not to store explicitly. > If so the "shall" that every pp-token that becomes a token > must have one the approved shapes leaves us with possibly > many pp-tokens that don't become tokens. I assume that > what happens to them is more undefined behavior. A "shall" indicates undefined behavior only if it does not occur in a constraints section (4p2). That "shall" occurs in 6.4p2, which is a constraints section. Therefore, preprocessing tokens which violate this rule are constraint violations. At least one diagnostic is required for any program that contains such a violation. As with all constraint violations, if you wish, your compiler can reject such code. If it chooses to continue translation, and if user chooses to execute the resulting programs despite having received the diagnostic, the behavior is indeed undefined - but first there must be a diagnostic
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-27 22:16 -0700 |
| Message-ID | <027ced74-afdc-4db1-aa14-7b3fb7a22295@googlegroups.com> |
| In reply to | #120419 |
On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote: > On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote: > > I am having trouble reading the standard on this one small > > point. I will quote the C89 standard but I have checked the > > C11 standard and seems to have exactly the same problem. > > > > The question is - exactly what does the standard mean by > > "lexical form" and "lexical elements". There is no > > definition withing the standard and (apart from a forward > > reference in 5.1.1.2) they are only appear in 6.4 (6.1 > > in C89). > > > > There are expressions like "Each preprocessing token that is > > concerted to a token shall have the lexical form of a > > keyword, an identifier, ...". This says each pp-token has > > something called a "lexical form" which seems to have five > > values (six in C89). The pp-tokens in my preprocessor are > > unsigned ints (and are unchanged as tokens. > > Each preprocessing token corresponds to and is defined in terms of a sequence of > source code characters; the same is true of a token. Section 6.4 describes which > sequences of source code characters qualify as each type of preprocessing token > or token. I could understand using unsigned ints to represent the punctuators > and character constants, but I don't see how that's a reasonable way to > represent any of the other kinds of preprocessing tokens. What unsigned integer > do you use to represent the identifier department_sales_per_month? There's a lot > of different possible identifiers, more than easily be represented as distinct > unsigned integer values on most platforms. The same is true of string literals. There are around one hundred pre-defined tokens which I number from 1 to whatever. Then each new identifier, number, string or character gets the next token number. The tokens are indexes into an array of token data records (currently a union of several different structs). I admit to being surprised anyone would use any different approach although instead of indexes I might have used data pointers. > I'd expect most of the types of preprocessing tokens to be represented > internally by strings of source code characters, at least until translation > phase 7, when they can finally be parsed and converted into other forms. > > > I observe that 5.1.1.2 says "decomposed". Perhaps - it > > seems - the lexical form is the character string that > > originated the token (minus the quotes on strings and > > characters). Is this reading correct? > > I don't believe so. The lexical form of a string literal includes the quote > characters, as well as the prefix, if any (see 6.4.5p1). Similar comments apply > to character constants. You might be able to get away with considering those > things to be implied by the fact that they have been parsed as string literals > or character constants, so you don't actually have to store the single or double > quotes. However, when it says "lexical form", it's referring to the entire > sequence of source code characters that make up a given token, even the ones > that you choose not to store explicitly. > > > If so the "shall" that every pp-token that becomes a token > > must have one the approved shapes leaves us with possibly > > many pp-tokens that don't become tokens. I assume that > > what happens to them is more undefined behavior. > > A "shall" indicates undefined behavior only if it does not occur in a > constraints section (4p2). That "shall" occurs in 6.4p2, which is a constraints > section. Therefore, preprocessing tokens which violate this rule are constraint > violations. At least one diagnostic is required for any program that contains > such a violation. As with all constraint violations, if you wish, your compiler > can reject such code. If it chooses to continue translation, and if user chooses > to execute the resulting programs despite having received the diagnostic, the > behavior is indeed undefined - but first there must be a diagnostic
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 09:45 -0700 |
| Message-ID | <5e681899-3b70-44c5-b919-5dccadf62672@googlegroups.com> |
| In reply to | #120424 |
On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote: > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote: > > On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote: > > > I am having trouble reading the standard on this one small > > > point. I will quote the C89 standard but I have checked the > > > C11 standard and seems to have exactly the same problem. > > > > > > The question is - exactly what does the standard mean by > > > "lexical form" and "lexical elements". There is no > > > definition withing the standard and (apart from a forward > > > reference in 5.1.1.2) they are only appear in 6.4 (6.1 > > > in C89). > > > > > > There are expressions like "Each preprocessing token that is > > > concerted to a token shall have the lexical form of a > > > keyword, an identifier, ...". This says each pp-token has > > > something called a "lexical form" which seems to have five > > > values (six in C89). The pp-tokens in my preprocessor are > > > unsigned ints (and are unchanged as tokens. > > > > Each preprocessing token corresponds to and is defined in terms of a sequence of > > source code characters; the same is true of a token. Section 6.4 describes which > > sequences of source code characters qualify as each type of preprocessing token > > or token. I could understand using unsigned ints to represent the punctuators > > and character constants, but I don't see how that's a reasonable way to > > represent any of the other kinds of preprocessing tokens. What unsigned integer > > do you use to represent the identifier department_sales_per_month? There's a lot > > of different possible identifiers, more than easily be represented as distinct > > unsigned integer values on most platforms. The same is true of string literals. > > There are around one hundred pre-defined tokens which I > number from 1 to whatever. ... Those presumably correspond to keywords and punctuators. > ... Then each new identifier, number, > string or character gets the next token number. The tokens > are indexes into an array of token data records (currently > a union of several different structs). > > I admit to being surprised anyone would use any different > approach although instead of indexes I might have used > data pointers. I would probably use a similar data structure, but I would describe it differently: your token data records are (presumably) where you store the actual lexical form of the token, your indices just identify which token data record the token is stored in. For each identifier, you need to retain the full sequence of characters that make up the identifier, so your compiler can recognize the fact that a later occurrence of the same identifier is in fact an occurrence of the same identifier. It needn't be the exact same sequence that appeared in the source code: you need to recognize members of the extended character set as matches to corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to the corresponding 4-digit UCNs, which implies that what you actually store should be a normalized version of the identifier. You can only dispose of that information once you've applied all of C's rules about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for identifiers with external linkage until linkage has been resolved, in translation phase 8. Similarly, you have to retain the complete list of source code characters corresponding to each string literal or character constant until translation phase 5, where they get converted into corresponding characters in the execution character set - which must then be stored somewhere in the translated program, unless the string turns out to be unused and therefore discardable. For most string literals, you can never completely discard the string of characters.
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 11:58 -0700 |
| Message-ID | <3c9e515f-7fa2-43e0-8b0d-749586b74554@googlegroups.com> |
| In reply to | #120457 |
On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote: > On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote: > > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote: > > > On Wednesday, September 27, 2017 at 10:03:34 PM UTC-4, David Kleinecke wrote: > > > > I am having trouble reading the standard on this one small > > > > point. I will quote the C89 standard but I have checked the > > > > C11 standard and seems to have exactly the same problem. > > > > > > > > The question is - exactly what does the standard mean by > > > > "lexical form" and "lexical elements". There is no > > > > definition withing the standard and (apart from a forward > > > > reference in 5.1.1.2) they are only appear in 6.4 (6.1 > > > > in C89). > > > > > > > > There are expressions like "Each preprocessing token that is > > > > concerted to a token shall have the lexical form of a > > > > keyword, an identifier, ...". This says each pp-token has > > > > something called a "lexical form" which seems to have five > > > > values (six in C89). The pp-tokens in my preprocessor are > > > > unsigned ints (and are unchanged as tokens. > > > > > > Each preprocessing token corresponds to and is defined in terms of a sequence of > > > source code characters; the same is true of a token. Section 6.4 describes which > > > sequences of source code characters qualify as each type of preprocessing token > > > or token. I could understand using unsigned ints to represent the punctuators > > > and character constants, but I don't see how that's a reasonable way to > > > represent any of the other kinds of preprocessing tokens. What unsigned integer > > > do you use to represent the identifier department_sales_per_month? There's a lot > > > of different possible identifiers, more than easily be represented as distinct > > > unsigned integer values on most platforms. The same is true of string literals. > > > > There are around one hundred pre-defined tokens which I > > number from 1 to whatever. ... > > Those presumably correspond to keywords and punctuators. > > > ... Then each new identifier, number, > > string or character gets the next token number. The tokens > > are indexes into an array of token data records (currently > > a union of several different structs). > > > > I admit to being surprised anyone would use any different > > approach although instead of indexes I might have used > > data pointers. > > I would probably use a similar data structure, but I would describe it > differently: your token data records are (presumably) where you store the actual > lexical form of the token, your indices just identify which token data record > the token is stored in. > > For each identifier, you need to retain the full sequence of characters that > make up the identifier, so your compiler can recognize the fact that a later > occurrence of the same identifier is in fact an occurrence of the same > identifier. It needn't be the exact same sequence that appeared in the source > code: you need to recognize members of the extended character set as matches to > corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to > the corresponding 4-digit UCNs, which implies that what you actually store > should be a normalized version of the identifier. > You can only dispose of that information once you've applied all of C's rules > about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for > identifiers with external linkage until linkage has been resolved, in > translation phase 8. > > Similarly, you have to retain the complete list of source code characters > corresponding to each string literal or character constant until translation > phase 5, where they get converted into corresponding characters in the execution > character set - which must then be stored somewhere in the translated program, > unless the string turns out to be unused and therefore discardable. For most > string literals, you can never completely discard the string of characters. Everything you say sounds right. But my question is ontological. What exactly is THE token?
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 12:29 -0700 |
| Message-ID | <8c07cb3a-242f-4f5c-9430-522ee20b888c@googlegroups.com> |
| In reply to | #120477 |
On Thursday, September 28, 2017 at 2:58:45 PM UTC-4, David Kleinecke wrote: > On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote: > > On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote: > > > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote: ... > > > > Each preprocessing token corresponds to and is defined in terms of a sequence of > > > > source code characters; the same is true of a token. Section 6.4 describes which > > > > sequences of source code characters qualify as each type of preprocessing token > > > > or token. I could understand using unsigned ints to represent the punctuators > > > > and character constants, but I don't see how that's a reasonable way to > > > > represent any of the other kinds of preprocessing tokens. What unsigned integer > > > > do you use to represent the identifier department_sales_per_month? There's a lot > > > > of different possible identifiers, more than easily be represented as distinct > > > > unsigned integer values on most platforms. The same is true of string literals. > > > > > > There are around one hundred pre-defined tokens which I > > > number from 1 to whatever. ... > > > > Those presumably correspond to keywords and punctuators. > > > > > ... Then each new identifier, number, > > > string or character gets the next token number. The tokens > > > are indexes into an array of token data records (currently > > > a union of several different structs). > > > > > > I admit to being surprised anyone would use any different > > > approach although instead of indexes I might have used > > > data pointers. > > > > I would probably use a similar data structure, but I would describe it > > differently: your token data records are (presumably) where you store the actual > > lexical form of the token, your indices just identify which token data record > > the token is stored in. > > > > For each identifier, you need to retain the full sequence of characters that > > make up the identifier, so your compiler can recognize the fact that a later > > occurrence of the same identifier is in fact an occurrence of the same > > identifier. It needn't be the exact same sequence that appeared in the source > > code: you need to recognize members of the extended character set as matches to > > corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to > > the corresponding 4-digit UCNs, which implies that what you actually store > > should be a normalized version of the identifier. > > You can only dispose of that information once you've applied all of C's rules > > about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for > > identifiers with external linkage until linkage has been resolved, in > > translation phase 8. > > > > Similarly, you have to retain the complete list of source code characters > > corresponding to each string literal or character constant until translation > > phase 5, where they get converted into corresponding characters in the execution > > character set - which must then be stored somewhere in the translated program, > > unless the string turns out to be unused and therefore discardable. For most > > string literals, you can never completely discard the string of characters. > > Everything you say sounds right. But my question is ontological. > What exactly is THE token? If what I said doesn't constitute a sufficient answer to your question, then I don't understand your question. Maybe if you concentrate on practical consequences of your ontological question, I could help. Give me a practical question about how your compiler should do something, whose answer depends upon the answer to that ontological question. It's entirely possible that I can answer the practical question, without understanding the connection you see between that question and the ontological one. If the ontological question's answer doesn't have any practical consequences, you really don't need that answer, however much you might like to have it.
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 15:52 -0700 |
| Message-ID | <022ed8a4-f2b6-48f6-b3e3-23760998ef32@googlegroups.com> |
| In reply to | #120481 |
On Thursday, September 28, 2017 at 12:29:41 PM UTC-7, james...@verizon.net wrote: > On Thursday, September 28, 2017 at 2:58:45 PM UTC-4, David Kleinecke wrote: > > On Thursday, September 28, 2017 at 9:45:17 AM UTC-7, james...@verizon.net wrote: > > > On Thursday, September 28, 2017 at 1:17:07 AM UTC-4, David Kleinecke wrote: > > > > On Wednesday, September 27, 2017 at 9:03:42 PM UTC-7, james...@verizon.net wrote: > ... > > > > > Each preprocessing token corresponds to and is defined in terms of a sequence of > > > > > source code characters; the same is true of a token. Section 6.4 describes which > > > > > sequences of source code characters qualify as each type of preprocessing token > > > > > or token. I could understand using unsigned ints to represent the punctuators > > > > > and character constants, but I don't see how that's a reasonable way to > > > > > represent any of the other kinds of preprocessing tokens. What unsigned integer > > > > > do you use to represent the identifier department_sales_per_month? There's a lot > > > > > of different possible identifiers, more than easily be represented as distinct > > > > > unsigned integer values on most platforms. The same is true of string literals. > > > > > > > > There are around one hundred pre-defined tokens which I > > > > number from 1 to whatever. ... > > > > > > Those presumably correspond to keywords and punctuators. > > > > > > > ... Then each new identifier, number, > > > > string or character gets the next token number. The tokens > > > > are indexes into an array of token data records (currently > > > > a union of several different structs). > > > > > > > > I admit to being surprised anyone would use any different > > > > approach although instead of indexes I might have used > > > > data pointers. > > > > > > I would probably use a similar data structure, but I would describe it > > > differently: your token data records are (presumably) where you store the actual > > > lexical form of the token, your indices just identify which token data record > > > the token is stored in. > > > > > > For each identifier, you need to retain the full sequence of characters that > > > make up the identifier, so your compiler can recognize the fact that a later > > > occurrence of the same identifier is in fact an occurrence of the same > > > identifier. It needn't be the exact same sequence that appeared in the source > > > code: you need to recognize members of the extended character set as matches to > > > corresponding UCNs, and you should match 8-digit UCNS which start with 0000 to > > > the corresponding 4-digit UCNs, which implies that what you actually store > > > should be a normalized version of the identifier. > > > You can only dispose of that information once you've applied all of C's rules > > > about scopes, linkage, and name spaces of identifiers, which doesn't apply untiltranslation phase 7. Even then, you must retain such information for > > > identifiers with external linkage until linkage has been resolved, in > > > translation phase 8. > > > > > > Similarly, you have to retain the complete list of source code characters > > > corresponding to each string literal or character constant until translation > > > phase 5, where they get converted into corresponding characters in the execution > > > character set - which must then be stored somewhere in the translated program, > > > unless the string turns out to be unused and therefore discardable. For most > > > string literals, you can never completely discard the string of characters. > > > > Everything you say sounds right. But my question is ontological. > > What exactly is THE token? > > If what I said doesn't constitute a sufficient answer to your question, then I > don't understand your question. Maybe if you concentrate on practical > consequences of your ontological question, I could help. Give me a practical > question about how your compiler should do something, whose answer depends upon > the answer to that ontological question. It's entirely possible that I can > answer the practical question, without understanding the connection you see > between that question and the ontological one. > If the ontological question's answer doesn't have any practical consequences, > you really don't need that answer, however much you might like to have it. I don't see how I can conform to what the standard expects without knowing what a token is. What is your definition of a token?
[toc] | [prev] | [next] | [standalone]
| From | Joe Pfeiffer <pfeiffer@cs.nmsu.edu> |
|---|---|
| Date | 2017-09-28 17:40 -0600 |
| Message-ID | <1bo9pugznx.fsf@pfeifferfamily.net> |
| In reply to | #120505 |
David Kleinecke <dkleinecke@gmail.com> writes: > On Thursday, September 28, 2017 at 12:29:41 PM UTC-7, james...@verizon.net wrote: >> >> If what I said doesn't constitute a sufficient answer to your question, then I >> don't understand your question. Maybe if you concentrate on practical >> consequences of your ontological question, I could help. Give me a practical >> question about how your compiler should do something, whose answer depends upon >> the answer to that ontological question. It's entirely possible that I can >> answer the practical question, without understanding the connection you see >> between that question and the ontological one. >> If the ontological question's answer doesn't have any practical consequences, >> you really don't need that answer, however much you might like to have it. > > I don't see how I can conform to what the standard expects without > knowing what a token is. What is your definition of a token? Operationally, it's what the lexical analyzer hands off to the parser.
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 16:54 -0700 |
| Message-ID | <2b789c99-64e5-4fca-8491-daa67bb49314@googlegroups.com> |
| In reply to | #120505 |
On Thursday, September 28, 2017 at 6:53:11 PM UTC-4, David Kleinecke wrote: ... > I don't see how I can conform to what the standard expects without > knowing what a token is. What is your definition of a token? A token is any preprocessing-token whose corresponding string of source code characters matches, during translation phase 7 (5.1.1.2p7), the grammar rule specified for a 'token' in section 6.4p1. A preprocessing token, in turn, is any sequence of source code characters identified during translation phase 3 (5.1.1.2p3) as matching the grammar rule for "preprocessing-token" in section 6.4p1. By specifying that these rules apply only during specific translation phases, I'm emphasizing that the processing specified for the other translation phases affects what qualifies as preprocessing-token or as a token. If there's anything you find unclear about that, it would be best if you try to convert your uncertainty into a practical question about what is permitted or prohibited with regards to how you implement the standard's rules. Your usual questions are too abstract to be answered in a way that you'll understand, when only you know what the words you're using actually mean to you.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <kst-u@mib.org> |
|---|---|
| Date | 2017-09-28 12:40 -0700 |
| Message-ID | <lnfub6wr0l.fsf@kst-u.example.com> |
| In reply to | #120477 |
David Kleinecke <dkleinecke@gmail.com> writes:
[...]
> Everything you say sounds right. But my question is ontological.
> What exactly is THE token?
As far as the language is concerned, a token is a syntactic element of a
C translation unit, a sequence of characters in a source file. Given:
int x = 42;
you have 5 tokens, consisting of 3, 1, 1, 2, and 1 characters
respectively. Tokens are distinguished from non-token character
sequences (like "42;" in the above) syntactically.
A compiler will thpically have some internal representation of a token.
The details of that representation are unspecified, and generally are
whatever is convenient for use in the compiler.
Does that answer your question?
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 16:12 -0700 |
| Message-ID | <7f17072d-c202-4590-8f56-0d003a2e8c39@googlegroups.com> |
| In reply to | #120483 |
On Thursday, September 28, 2017 at 12:40:50 PM UTC-7, Keith Thompson wrote: > David Kleinecke <dkleinecke@gmail.com> writes: > [...] > > Everything you say sounds right. But my question is ontological. > > What exactly is THE token? > > As far as the language is concerned, a token is a syntactic element of a > C translation unit, a sequence of characters in a source file. Given: > > int x = 42; > > you have 5 tokens, consisting of 3, 1, 1, 2, and 1 characters > respectively. Tokens are distinguished from non-token character > sequences (like "42;" in the above) syntactically. > > A compiler will thpically have some internal representation of a token. > The details of that representation are unspecified, and generally are > whatever is convenient for use in the compiler. > > Does that answer your question? I think that is the dominant idea but I see no reason for it. I would rather use the word "token" to mean the internal representation which has the overt character string as one of its attributes. Consider the situation where an identifier is re-used within a block. The new (inner) meaning is different than the old (outer) meaning but the character strings are the same. I implement the situation by creating a new token (which is destroyed when the code leaves the block).
[toc] | [prev] | [next] | [standalone]
| From | bartc <bc@freeuk.com> |
|---|---|
| Date | 2017-09-28 21:04 +0100 |
| Message-ID | <6BczB.1462043$nU3.1408556@fx43.am4> |
| In reply to | #120477 |
On 28/09/2017 19:58, David Kleinecke wrote:
> Everything you say sounds right. But my question is ontological.
> What exactly is THE token?
Below is the list of tokens I use in my C project. Each is represented
by some index.
Up to EOF, each token represents some special symbol or punctuation.
After that, they are mainly numbers, keywords and identifiers.
Those marked * have an associated value - number, string constant or
identifier.
Those marked + are split into two or more sub-tokens indicated by a
further value (different preprocessor directives for example).
These mostly serve both as pre-processor tokens, and tokens used for
parsing. (C is like that, nothing can be straightforward.)
Don't know if this helps.
----------------------------------------
ERROR
DOT
IDOT "->"
LEXHASH
HASH
LITHASH
HASHHASH
COMMA
SEMI
COLON
ASSIGN
LBRACK
RBRACK
LSQ
RSQ
LCURLY "{"
RCURLY "}"
QUESTION
CURL
ELLIPSIS
ADD
SUB
MUL
DIV
REM
IOR "|"
IAND
IXOR
ORL "||"
ANDL
SHL
SHR
INOT
NOTL
INCR
DECR
EQ
NE
LT
LE
GE
GT
ADDTO "+="
SUBTO
MULTO
DIVTO
REMTO
IORTO
IANDTO
IXORTO
SHLTO
SHRTO
EOL
EOF
RAWNUMBER *
INTCONST *
REALCONST *
CHARCONST *
WCHARCONST *
STRINGCONST *
WSTRINGCONST *
WHITESPACE
PLACEHOLDER
NAME *
SOURCEDIR +
PREDEFMACRO +
TYPESPEC +
IF
ELSE
CASE
DEFAULT
FOR
WHILE
DO
RETURN
BREAK
CONTINUE
GOTO
SWITCH
STRUCT
UNION
LINKAGE +
TYPEQUAL +
FNSPEC
ALIGNAS
ENUM
CALLCONV
SIZEOF
DEFINED
GENERIC
ALIGNOF
----------------------------------------
(The above aren't the actual tokens used in the source, which are more
like kbreaksym rather then BREAK. As some would be bad choices for names.)
--
bartc
[toc] | [prev] | [next] | [standalone]
| From | bartc <bc@freeuk.com> |
|---|---|
| Date | 2017-09-28 22:12 +0100 |
| Message-ID | <hBdzB.1604944$tu4.400471@fx35.am4> |
| In reply to | #120486 |
On 28/09/2017 21:04, bartc wrote:
> On 28/09/2017 19:58, David Kleinecke wrote:
>
>> Everything you say sounds right. But my question is ontological.
>> What exactly is THE token?
>
> Below is the list of tokens I use in my C project. Each is represented
> by some index.
...
> GENERIC
> ALIGNOF
>
> ----------------------------------------
>
> (The above aren't the actual tokens used in the source, which are more
> like kbreaksym rather then BREAK. As some would be bad choices for names.)
The tokens as declared in the actual project (not C despite it saying C):
https://pastebin.com/P5P3gcF1
As for what a token is, each one is more completely represented in my
implementation by this struct (in C this time):
struct _tokenrec { // 32 bytes total
union {
int64 value;
double xvalue;
uint64 uvalue;
byte * svalue;
strec * symptr;
};
struct _tokenrec* nexttoken;
union {
struct {
byte subcode;
byte flags;
};
uint16 subcodex;
};
byte symbol;
byte fileno;
int32 lineno;
int32 length;
union {
int32 numberoffset;
int32 paramno;
int32 pasteno;
};
};
The token index is stored in .symbol. How the other members are used
depends on the token. For simple ones ('left bracket') nothing else is
needed.
--
bartc
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 16:15 -0700 |
| Message-ID | <63836496-889f-470c-9257-ee8ad959b93e@googlegroups.com> |
| In reply to | #120486 |
On Thursday, September 28, 2017 at 1:04:26 PM UTC-7, Bart wrote:
> On 28/09/2017 19:58, David Kleinecke wrote:
>
> > Everything you say sounds right. But my question is ontological.
> > What exactly is THE token?
>
> Below is the list of tokens I use in my C project. Each is represented
> by some index.
>
> Up to EOF, each token represents some special symbol or punctuation.
> After that, they are mainly numbers, keywords and identifiers.
>
> Those marked * have an associated value - number, string constant or
> identifier.
>
> Those marked + are split into two or more sub-tokens indicated by a
> further value (different preprocessor directives for example).
>
> These mostly serve both as pre-processor tokens, and tokens used for
> parsing. (C is like that, nothing can be straightforward.)
>
> Don't know if this helps.
>
> ----------------------------------------
>
> ERROR
> DOT
> IDOT "->"
> LEXHASH
> HASH
> LITHASH
> HASHHASH
> COMMA
> SEMI
> COLON
> ASSIGN
> LBRACK
> RBRACK
> LSQ
> RSQ
> LCURLY "{"
> RCURLY "}"
> QUESTION
> CURL
> ELLIPSIS
> ADD
> SUB
> MUL
> DIV
> REM
> IOR "|"
> IAND
> IXOR
> ORL "||"
> ANDL
> SHL
> SHR
> INOT
> NOTL
> INCR
> DECR
> EQ
> NE
> LT
> LE
> GE
> GT
> ADDTO "+="
> SUBTO
> MULTO
> DIVTO
> REMTO
> IORTO
> IANDTO
> IXORTO
> SHLTO
> SHRTO
> EOL
> EOF
>
> RAWNUMBER *
> INTCONST *
> REALCONST *
> CHARCONST *
> WCHARCONST *
> STRINGCONST *
> WSTRINGCONST *
> WHITESPACE
> PLACEHOLDER
>
> NAME *
> SOURCEDIR +
> PREDEFMACRO +
>
> TYPESPEC +
> IF
> ELSE
> CASE
> DEFAULT
> FOR
> WHILE
> DO
> RETURN
> BREAK
> CONTINUE
> GOTO
> SWITCH
> STRUCT
> UNION
> LINKAGE +
> TYPEQUAL +
> FNSPEC
> ALIGNAS
> ENUM
> CALLCONV
> SIZEOF
> DEFINED
> GENERIC
> ALIGNOF
>
> ----------------------------------------
>
> (The above aren't the actual tokens used in the source, which are more
> like kbreaksym rather then BREAK. As some would be bad choices for names.)
There are a lot fewer pre-defined tokens in my world because I am
sticking to C90. But I have a similar list.
[toc] | [prev] | [next] | [standalone]
| From | bartc <bc@freeuk.com> |
|---|---|
| Date | 2017-09-29 01:19 +0100 |
| Message-ID | <HkgzB.803892$uh.558686@fx28.am4> |
| In reply to | #120508 |
On 29/09/2017 00:15, David Kleinecke wrote: > On Thursday, September 28, 2017 at 1:04:26 PM UTC-7, Bart wrote: >> DOT ... >> ALIGNOF > There are a lot fewer pre-defined tokens in my world because I am > sticking to C90. But I have a similar list. I think there are 92 tokens above; earlier in the thread you mentioned 100 tokens. So not so different. But they may diverge from there because you said every constant and identifier will have its own token number. (Every unique constant and identifier, or will ABC followed by ABC be represented by two consecutive token numbers?) (And also, some C99 and C11 keywords are recognised if the occur in source code, but are then ignored. So a wider range of inputs can be processed.) -- bartc
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.c
csiph-web