Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #120416 > unrolled thread
| Started by | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| First post | 2017-09-27 19:03 -0700 |
| Last post | 2017-09-29 01:19 +0100 |
| Articles | 20 on this page of 34 — 9 participants |
Back to article view | Back to comp.lang.c
Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 19:03 -0700
Re: Lexical Elements "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-09-28 05:33 +0200
Re: Lexical Elements James Kuyper <jameskuyper@verizon.net> - 2017-09-28 00:19 -0400
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:09 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 08:31 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:53 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:16 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:51 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:42 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:37 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:16 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 18:39 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 19:47 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 20:29 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 22:36 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-29 08:47 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-29 11:23 -0700
Re: Lexical Elements Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-09-29 18:27 +0100
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:13 -0700
Re: Lexical Elements Richard Damon <Richard@Damon-Family.org> - 2017-09-28 08:15 -0400
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-27 21:03 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:16 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:45 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:58 -0700
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:29 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:52 -0700
Re: Lexical Elements Joe Pfeiffer <pfeiffer@cs.nmsu.edu> - 2017-09-28 17:40 -0600
Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:54 -0700
Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:40 -0700
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:12 -0700
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 21:04 +0100
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 22:12 +0100
Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:15 -0700
Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-29 01:19 +0100
Page 1 of 2 [1] 2 Next page →
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-27 19:03 -0700 |
| Subject | Lexical Elements |
| Message-ID | <cc7eadf6-6c89-4139-9050-3606d3c0ab01@googlegroups.com> |
I am having trouble reading the standard on this one small point. I will quote the C89 standard but I have checked the C11 standard and seems to have exactly the same problem. The question is - exactly what does the standard mean by "lexical form" and "lexical elements". There is no definition withing the standard and (apart from a forward reference in 5.1.1.2) they are only appear in 6.4 (6.1 in C89). There are expressions like "Each preprocessing token that is concerted to a token shall have the lexical form of a keyword, an identifier, ...". This says each pp-token has something called a "lexical form" which seems to have five values (six in C89). The pp-tokens in my preprocessor are unsigned ints (and are unchanged as tokens. I observe that 5.1.1.2 says "decomposed". Perhaps - it seems - the lexical form is the character string that originated the token (minus the quotes on strings and characters). Is this reading correct? If so the "shall" that every pp-token that becomes a token must have one the approved shapes leaves us with possibly many pp-tokens that don't become tokens. I assume that what happens to them is more undefined behavior.
[toc] | [next] | [standalone]
| From | "Pascal J. Bourguignon" <pjb@informatimago.com> |
|---|---|
| Date | 2017-09-28 05:33 +0200 |
| Message-ID | <m2o9pv33am.fsf@despina.home> |
| In reply to | #120416 |
David Kleinecke <dkleinecke@gmail.com> writes:
> I am having trouble reading the standard on this one small
> point. I will quote the C89 standard but I have checked the
> C11 standard and seems to have exactly the same problem.
>
> The question is - exactly what does the standard mean by
> "lexical form" and "lexical elements". There is no
> definition withing the standard and (apart from a forward
> reference in 5.1.1.2) they are only appear in 6.4 (6.1
> in C89).
>
> There are expressions like "Each preprocessing token that is
> converted to a token shall have the lexical form of a
> keyword, an identifier, ...". This says each pp-token has
> something called a "lexical form" which seems to have five
> values (six in C89). The pp-tokens in my preprocessor are
> unsigned ints (and are unchanged as tokens.
n1124.pdf mentions seven pp-tokens:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
non-white-space that cannot be one of the above
Assume a pre-processor/compiler that processes UTF-8 sources. So you
can have a string such as "Cet été était chaud." However, été still
cannot be an identifier. (You would have to write it as \u00e9t\u00e9).
So été is a pp-token (non-white-space that cannot be one of the above),
that is not an acceptable lexical form to be converted into a (C) token.
Or perhaps more precisely, since AFAICS, été should be interpreted by
the pre-processor as 3 pp-tokens, é t and é, so perhaps it should be:
é is a pp-token (non-white-space that cannot be one of the above),
that is not an acceptable lexical form to be converted into a (C)
token.
> I observe that 5.1.1.2 says "decomposed". Perhaps - it
> seems - the lexical form is the character string that
> originated the token (minus the quotes on strings and
> characters). Is this reading correct?
in n1124.pdf:
3. The source file is decomposed into preprocessing tokens6) and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or in a
partial comment.
This "decomposed" verb only refers to the pre-processor parsing of the
source text.
> If so the "shall" that every pp-token that becomes a token
> must have one the approved shapes leaves us with possibly
> many pp-tokens that don't become tokens. I assume that
> what happens to them is more undefined behavior.
Not really undefined: it will provoke a lexical error with the C
compiler.
Conceptually, we have two scanners
- the pre-processor scanner,
- the C scanner.
and they don't scan the same tokens!
You can use the pre-processor independently from the C compiler, on
non-C source files. It will transform text to text, and will leave
alone anything that it doesn't handle (ie. anything it doesn't parse as
pre-processor directives or macros). So the pp-token categories must
cover all the possible text, including things like "é" that are lexical
errors for a C compiler.
However, when used as a C compiler front-end, we can avoid the
generation of textual output from the pre-processor, and re-scanning the
whole text by the C compiler. There's then a direct conversion of the
pre-processor pp-token sequence into a C token sequence. However, this
conversion can be performed only when the pp-token matches the syntax of
a C token. (From a text source, the C compiler would signal a lexical
error; in the case where this conversion occurs, the converter can
signal the lexical error).
Another example, which is explicitely given in n1124.pdf, is the
pp-number 1Ex. This pp-token doesn't have the lexical form of a C
number token. So it cannot be converted.
So the "lexical form" is the _pattern_ that correspond to a C token.
--
__Pascal J. Bourguignon
http://www.informatimago.com
[toc] | [prev] | [next] | [standalone]
| From | James Kuyper <jameskuyper@verizon.net> |
|---|---|
| Date | 2017-09-28 00:19 -0400 |
| Message-ID | <oqht8d$ke3$1@dont-email.me> |
| In reply to | #120418 |
On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote: > David Kleinecke <dkleinecke@gmail.com> writes: ... > Assume a pre-processor/compiler that processes UTF-8 sources. So you > can have a string such as "Cet �t� �tait chaud." However, �t� still > cannot be an identifier. (You would have to write it as \u00e9t\u00e9). A string literal can contain 'any member of the source character set except the double-quote ", backslash \, or new-line character' (6.4.5p1). If you can put those characters in a source code file, they are allowed in a string literal. >> I observe that 5.1.1.2 says "decomposed". Perhaps - it >> If so the "shall" that every pp-token that becomes a token >> must have one the approved shapes leaves us with possibly >> many pp-tokens that don't become tokens. I assume that >> what happens to them is more undefined behavior. > > Not really undefined: it will provoke a lexical error with the C > compiler. A constraint violation, to be precise (6.4p2).
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-27 22:09 -0700 |
| Message-ID | <e1753783-684f-45a1-8600-e0230fa9ba81@googlegroups.com> |
| In reply to | #120420 |
On Wednesday, September 27, 2017 at 9:19:34 PM UTC-7, James Kuyper wrote: > On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote: > > David Kleinecke <dkleinecke@gmail.com> writes: > ... > > Assume a pre-processor/compiler that processes UTF-8 sources. So you > > can have a string such as "Cet �t� �tait chaud." However, �t� still > > cannot be an identifier. (You would have to write it as \u00e9t\u00e9). > > A string literal can contain 'any member of the source character set > except the double-quote ", backslash \, or new-line character' > (6.4.5p1). If you can put those characters in a source code file, they > are allowed in a string literal. > > >> I observe that 5.1.1.2 says "decomposed". Perhaps - it > >> If so the "shall" that every pp-token that becomes a token > >> must have one the approved shapes leaves us with possibly > >> many pp-tokens that don't become tokens. I assume that > >> what happens to them is more undefined behavior. > > > > Not really undefined: it will provoke a lexical error with the C > > compiler. > > A constraint violation, to be precise (6.4p2). I think C89 has no concept of "constraint violation" or even of "violation". I can be faulted for using C89 of course but the C89 section 3 does not make any distinction between kinds of "shall". This seems to be nit-picking to me.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <kst-u@mib.org> |
|---|---|
| Date | 2017-09-28 08:31 -0700 |
| Message-ID | <lnmv5eyh4k.fsf@kst-u.example.com> |
| In reply to | #120422 |
David Kleinecke <dkleinecke@gmail.com> writes:
[...]
> I think C89 has no concept of "constraint violation" or
> even of "violation". I can be faulted for using C89 of
> course but the C89 section 3 does not make any
> distinction between kinds of "shall". This seems to
> be nit-picking to me.
C89 certainly does have the concept of "constraint violation".
See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
A conforming implementation shall produce at least one diagnostic
message (identified in an implementation-defined manner) for
every translation unit that contains a violation of any syntax
rule or constraint. Diagnostic messages need not be produced
in other circumstances.
Search your copy of the standard, or draft, or whatever you have, for
the word "constraint". There are far too many to list here.
The distinction between a "shall" or "shall not" without or outside a
constraint is discussed in the definition of "undefined behavior", C90
3.16. Later editions moved that discussion to section 4, Conformance.
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 11:53 -0700 |
| Message-ID | <d7b80520-7c0a-461f-99db-1b58dac1ee15@googlegroups.com> |
| In reply to | #120446 |
On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote: > David Kleinecke <dkleinecke@gmail.com> writes: > [...] > > I think C89 has no concept of "constraint violation" or > > even of "violation". I can be faulted for using C89 of > > course but the C89 section 3 does not make any > > distinction between kinds of "shall". This seems to > > be nit-picking to me. > > C89 certainly does have the concept of "constraint violation". > See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90): > > A conforming implementation shall produce at least one diagnostic > message (identified in an implementation-defined manner) for > every translation unit that contains a violation of any syntax > rule or constraint. Diagnostic messages need not be produced > in other circumstances. This is a nit almost too small to pick. It's 5.1.1.3 in C90. I suspect you were looking at the pre-boilerplate version (when the first three sections were added.) If so I indeed made an error. I should have said C90 not C89. Note that it does not differentiate between which kind of violation occurred. I read "violation" here as the ordinary English word (it is not in the index) and not a C technical term. I think that this and other passages strongly suggest that I should consider syntax and constraints together as an integrated system if I want to get into the spirit of C. There is no a priori reason for doing so and my old compiler integrated the constraints with code generation rather than syntax. I am trying to move the constraints over to the same processing as the syntax and that's where I stumbled over the concept of "lexical form.
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 12:16 -0700 |
| Message-ID | <eaed35b6-a83e-49e9-b2af-f9eff330ed56@googlegroups.com> |
| In reply to | #120476 |
On Thursday, September 28, 2017 at 2:53:59 PM UTC-4, David Kleinecke wrote: > On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote: > > David Kleinecke <dkleinecke@gmail.com> writes: > > [...] > > > I think C89 has no concept of "constraint violation" or > > > even of "violation". I can be faulted for using C89 of > > > course but the C89 section 3 does not make any > > > distinction between kinds of "shall". This seems to > > > be nit-picking to me. > > > > C89 certainly does have the concept of "constraint violation". > > See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90): > > > > A conforming implementation shall produce at least one diagnostic > > message (identified in an implementation-defined manner) for > > every translation unit that contains a violation of any syntax > > rule or constraint. Diagnostic messages need not be produced > > in other circumstances. > > This is a nit almost too small to pick. It's 5.1.1.3 in C90. I > suspect you were looking at the pre-boilerplate version (when the > first three sections were added.) If so I indeed made an error. > I should have said C90 not C89. All of the relevant wording is the same in C89 and C90 - only the section numbers are different. Therefore, the fact that what you said is false about about C89, means that it is also false about C90. > Note that it does not differentiate between which kind of violation > occurred. I read "violation" here as the ordinary English word (it > is not in the index) and not a C technical term. In particular, the wording that does distinguish between "shall" when it occurs in a constraint, and "shall" when it appears in other parts of the standard, is exactly the same: "If a ``shall'' or ``shall not'' requirement that appears outside of a constraint is violated, the behavior is undefined. ..." Please acknowledge the fact that is says "outside of a constraint" - you seem to be claiming that there's no such wording, despite it having been cited to you several times. > I think that this and other passages strongly suggest that I should > consider syntax and constraints together as an integrated system if > I want to get into the spirit of C. Yes, the C standard says exactly the same thing about violations of syntax rules that it does about violations of constraints: "A conforming implementation shall produce at least one diagnostic message ...". Many constraint violations could be converted into syntax errors by making the grammar substantially more complicated. Many syntax errors could have been described as constraints, at the cost of making the grammar significantly less meaningful.
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 15:51 -0700 |
| Message-ID | <c31f4a50-807c-46b1-86e7-928013546880@googlegroups.com> |
| In reply to | #120480 |
On Thursday, September 28, 2017 at 12:16:52 PM UTC-7, james...@verizon.net wrote: > On Thursday, September 28, 2017 at 2:53:59 PM UTC-4, David Kleinecke wrote: > > On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote: > > > David Kleinecke <dkleinecke@gmail.com> writes: > > > [...] > > > > I think C89 has no concept of "constraint violation" or > > > > even of "violation". I can be faulted for using C89 of > > > > course but the C89 section 3 does not make any > > > > distinction between kinds of "shall". This seems to > > > > be nit-picking to me. > > > > > > C89 certainly does have the concept of "constraint violation". > > > See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90): > > > > > > A conforming implementation shall produce at least one diagnostic > > > message (identified in an implementation-defined manner) for > > > every translation unit that contains a violation of any syntax > > > rule or constraint. Diagnostic messages need not be produced > > > in other circumstances. > > > > This is a nit almost too small to pick. It's 5.1.1.3 in C90. I > > suspect you were looking at the pre-boilerplate version (when the > > first three sections were added.) If so I indeed made an error. > > I should have said C90 not C89. > > All of the relevant wording is the same in C89 and C90 - only the section > numbers are different. Therefore, the fact that what you said is false about > about C89, means that it is also false about C90. > > > Note that it does not differentiate between which kind of violation > > occurred. I read "violation" here as the ordinary English word (it > > is not in the index) and not a C technical term. > > In particular, the wording that does distinguish between "shall" when it occurs > in a constraint, and "shall" when it appears in other parts of the standard, is > exactly the same: > > "If a ``shall'' or ``shall not'' requirement that appears outside of > a constraint is violated, the behavior is undefined. ..." > > Please acknowledge the fact that is says "outside of a constraint" - you seem to > be claiming that there's no such wording, despite it having been cited to you > several times. > > > I think that this and other passages strongly suggest that I should > > consider syntax and constraints together as an integrated system if > > I want to get into the spirit of C. > > Yes, the C standard says exactly the same thing about violations of syntax rules > that it does about violations of constraints: "A conforming implementation shall > produce at least one diagnostic message ...". Many constraint violations could > be converted into syntax errors by making the grammar substantially more > complicated. Many syntax errors could have been described as constraints, at > the cost of making the grammar significantly less meaningful. I see the wording you refer to (3.16 in my copy of C90) but all it seems to me to say is that a constraint violation might not be undefined behavior. But the next sentence seems to me to say that is undefined behavior. It's hard to see how it could fail to be undefined behavior. In my opinion this is all just sloppy wording created when the first three sections were added and there is nothing special about constraint violations.
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 16:42 -0700 |
| Message-ID | <f0abdabf-19a6-4a10-b462-9f67b3628b15@googlegroups.com> |
| In reply to | #120504 |
On Thursday, September 28, 2017 at 6:51:27 PM UTC-4, David Kleinecke wrote: > On Thursday, September 28, 2017 at 12:16:52 PM UTC-7, james...@verizon.net wrote: ... > > "If a ``shall'' or ``shall not'' requirement that appears outside of > > a constraint is violated, the behavior is undefined. ..." ... > I see the wording you refer to (3.16 in my copy of C90) but > all it seems to me to say is that a constraint violation might > not be undefined behavior. That says nothing one way or the other about the constraint violations, it only talks about things that are NOT constraint violations. > ... But the next sentence seems to me > to say that is undefined behavior. The sentence I've quoted above is part of the definition of "undefined behavior", and identifies one of the three ways the C standard marks something as undefined behavior. The "next sentence" you refer to lists the other two ways: "Undefined behavior is otherwise indicated in this Standard by the words ``undefined behavior'' or by the omission of any explicit definition of behavior.". Neither of those ways is specific to constraint violations. The committee intends that every list of options provided by the C standard be exhaustive - any case where that is not true constitutes a defect in the standard. It follows that only things identified as having undefined behavior in one of these three ways, have undefined behavior. I believe that it's the case that no constraint rule specifies "undefined behavior" explicitly, so the second case wouldn't apply. However, it also happens to be the case that the standard never provides an explicit definition of the behavior when a constraint is violated. I never realized this myself, I had to have someone else point it out to me, after which I checked to make sure. It's not an inherent feature of constraint rules, it's simply something that happens to be true about all current constraint rules, and could be false for the very next such rule that the committee creates or modifies. Therefore, after generating the required diagnostic, if an implementation chooses to continue translating the program, and if the user chooses to execute the translated program, the behavior is indeed undefined. But it's not ONLY undefined - a diagnostic is also required. > It's hard to see how it > could fail to be undefined behavior. In my opinion this is all > just sloppy wording created when the first three sections > were added ... I don't have a copy of C90, only C89. All of the text I quoted for you is from C89, prior to the addition of those three sections. > ... and there is nothing special about constraint > violations. There is something very special about constraint violations. Along with syntax rule violations, they are the only things to which the following requirement applies: "A conforming implementation shall produce at least one diagnostic message ...". (C90 2.1.1.3)
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <kst-u@mib.org> |
|---|---|
| Date | 2017-09-28 12:37 -0700 |
| Message-ID | <lnk20iwr59.fsf@kst-u.example.com> |
| In reply to | #120476 |
David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote:
>> David Kleinecke <dkleinecke@gmail.com> writes:
>> [...]
>> > I think C89 has no concept of "constraint violation" or
>> > even of "violation". I can be faulted for using C89 of
>> > course but the C89 section 3 does not make any
>> > distinction between kinds of "shall". This seems to
>> > be nit-picking to me.
>>
>> C89 certainly does have the concept of "constraint violation".
>> See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
>>
>> A conforming implementation shall produce at least one diagnostic
>> message (identified in an implementation-defined manner) for
>> every translation unit that contains a violation of any syntax
>> rule or constraint. Diagnostic messages need not be produced
>> in other circumstances.
>
> This is a nit almost too small to pick. It's 5.1.1.3 in C90. I
> suspect you were looking at the pre-boilerplate version (when the
> first three sections were added.) If so I indeed made an error.
> I should have said C90 not C89.
I wouldn't say it's *almost* too small to pick. I already quoted the
section numbers for both C89 and C90.
> Note that it does not differentiate between which kind of violation
> occurred. I read "violation" here as the ordinary English word (it
> is not in the index) and not a C technical term.
Correct. The syntax used by a compiler's parser needn't be 100%
consistent with the language grammar defined by the standard, as long as
correct code is parsed and analyzed correctly and required diagnostics
are issued.
> I think that this and other passages strongly suggest that I should
> consider syntax and constraints together as an integrated system if
> I want to get into the spirit of C. There is no a priori reason for
> doing so and my old compiler integrated the constraints with code
> generation rather than syntax. I am trying to move the constraints
> over to the same processing as the syntax and that's where I
> stumbled over the concept of "lexical form.
I don't know how you'd treat them as an "integrated system", though I
suppose it depends on just what you mean by that.
A compiler typically treats syntax and semantics separately. Syntactic
analysis, or parsing, is often semi-automated, with the parser generated
programatically from a forma grammar -- or it might be written manually.
If parsing fails, that's a syntax error. Other errors, like trying to
apply a shift operator to a pointer value, have to be detected during
semantic analysis. As far as typical compiler internals are concerned,
they're fundamentally different kinds of errors. Tweaking the grammar
to treat some syntax errors as semantic errors can enable better
diagnostics in some csaes.
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 16:16 -0700 |
| Message-ID | <d1352da5-c14d-445d-953a-b623686873f2@googlegroups.com> |
| In reply to | #120482 |
On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote: > > A compiler typically treats syntax and semantics separately. Syntactic > analysis, or parsing, is often semi-automated, with the parser generated > programatically from a forma grammar -- or it might be written manually. > If parsing fails, that's a syntax error. Other errors, like trying to > apply a shift operator to a pointer value, have to be detected during > semantic analysis. As far as typical compiler internals are concerned, > they're fundamentally different kinds of errors. Tweaking the grammar > to treat some syntax errors as semantic errors can enable better > diagnostics in some csaes. My compilers also treat syntax and semantics separately. The question is whether the constraints are part of syntax or part of semantics. You imply that some are and some aren't. Quite possible although I have no examples. I have concluded that what I called the spirit of C expects them with the syntax but I had them coded with the semantics. I have only begun moving the constraints and haven't yet found any that don't fit rather easily into the syntax. My thinking about syntax is strongly influenced by the tagmemic linguistics of Kenneth Pike and others.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <kst-u@mib.org> |
|---|---|
| Date | 2017-09-28 18:39 -0700 |
| Message-ID | <lning2uvtc.fsf@kst-u.example.com> |
| In reply to | #120509 |
David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
>>
>> A compiler typically treats syntax and semantics separately. Syntactic
>> analysis, or parsing, is often semi-automated, with the parser generated
>> programatically from a forma grammar -- or it might be written manually.
>> If parsing fails, that's a syntax error. Other errors, like trying to
>> apply a shift operator to a pointer value, have to be detected during
>> semantic analysis. As far as typical compiler internals are concerned,
>> they're fundamentally different kinds of errors. Tweaking the grammar
>> to treat some syntax errors as semantic errors can enable better
>> diagnostics in some csaes.
>
> My compilers also treat syntax and semantics separately. The
> question is whether the constraints are part of syntax or part
> of semantics.
Semantics, I'd say.
> You imply that some are and some aren't.
Did I?
> Quite possible
> although I have no examples. I have concluded that what
> I called the spirit of C expects them with the syntax but
> I had them coded with the semantics. I have only begun
> moving the constraints and haven't yet found any that
> don't fit rather easily into the syntax.
I suspect I really don't undrestand what you're saying.
Here's an example. C11 6.8.4.2p1 specifies the following constraint:
The controlling expression of a switch statement shall have integer
type.
Unless you extend the meaning of "syntax" beyond (my) recognition, you
won't be able to enforce that constraint using only syntax information.
Roughly, a syntax error is a failure to parse the source code in
accordance with a grammar that can be defined, for example, in BNF
(Backus-Naur Form). (C's treatment of typedefs punches a small hole in
this model.) Any errors that are not syntax errors are what I think of
as semantic errors. In C, this is more or less expressed as syntax
rules vs. constraints.
[...]
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 19:47 -0700 |
| Message-ID | <2d14fe25-9390-4711-b8e2-1cdefc3d56bf@googlegroups.com> |
| In reply to | #120515 |
On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote: > David Kleinecke <dkleinecke@gmail.com> writes: > > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote: > >> > >> A compiler typically treats syntax and semantics separately. Syntactic > >> analysis, or parsing, is often semi-automated, with the parser generated > >> programatically from a forma grammar -- or it might be written manually. > >> If parsing fails, that's a syntax error. Other errors, like trying to > >> apply a shift operator to a pointer value, have to be detected during > >> semantic analysis. As far as typical compiler internals are concerned, > >> they're fundamentally different kinds of errors. Tweaking the grammar > >> to treat some syntax errors as semantic errors can enable better > >> diagnostics in some csaes. > > > > My compilers also treat syntax and semantics separately. The > > question is whether the constraints are part of syntax or part > > of semantics. > > Semantics, I'd say. > > > You imply that some are and some aren't. > > Did I? > > > Quite possible > > although I have no examples. I have concluded that what > > I called the spirit of C expects them with the syntax but > > I had them coded with the semantics. I have only begun > > moving the constraints and haven't yet found any that > > don't fit rather easily into the syntax. > > I suspect I really don't undrestand what you're saying. > > Here's an example. C11 6.8.4.2p1 specifies the following constraint: > > The controlling expression of a switch statement shall have integer > type. > > Unless you extend the meaning of "syntax" beyond (my) recognition, you > won't be able to enforce that constraint using only syntax information. > > Roughly, a syntax error is a failure to parse the source code in > accordance with a grammar that can be defined, for example, in BNF > (Backus-Naur Form). (C's treatment of typedefs punches a small hole in > this model.) Any errors that are not syntax errors are what I think of > as semantic errors. In C, this is more or less expressed as syntax > rules vs. constraints. I understand syntax, perhaps, like a linguist would. In all the linguistic work I know what fills a slot (a slot like the identifier slot in a switch statement) can be sub-categorized to a subset of all identifiers. In this case the identifier must have the attribute "integer". The token (already identified as an identifier) is further sub-categorized by being declared, for example, an "int". The source of my concern about what a token actually is comes from this accumulation of additional attributes - which would include constant and volatile as well as static/extern.
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 20:29 -0700 |
| Message-ID | <9ab07b89-0566-4dab-8934-fee01c772883@googlegroups.com> |
| In reply to | #120517 |
On Thursday, September 28, 2017 at 10:48:03 PM UTC-4, David Kleinecke wrote: > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote: > > David Kleinecke <dkleinecke@gmail.com> writes: > > > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote: > > >> > > >> A compiler typically treats syntax and semantics separately. Syntactic > > >> analysis, or parsing, is often semi-automated, with the parser generated > > >> programatically from a forma grammar -- or it might be written manually. > > >> If parsing fails, that's a syntax error. Other errors, like trying to > > >> apply a shift operator to a pointer value, have to be detected during > > >> semantic analysis. As far as typical compiler internals are concerned, > > >> they're fundamentally different kinds of errors. Tweaking the grammar > > >> to treat some syntax errors as semantic errors can enable better > > >> diagnostics in some csaes. > > > > > > My compilers also treat syntax and semantics separately. The > > > question is whether the constraints are part of syntax or part > > > of semantics. > > > > Semantics, I'd say. > > > > > You imply that some are and some aren't. > > > > Did I? > > > > > Quite possible > > > although I have no examples. I have concluded that what > > > I called the spirit of C expects them with the syntax but > > > I had them coded with the semantics. I have only begun > > > moving the constraints and haven't yet found any that > > > don't fit rather easily into the syntax. > > > > I suspect I really don't undrestand what you're saying. > > > > Here's an example. C11 6.8.4.2p1 specifies the following constraint: > > > > The controlling expression of a switch statement shall have integer > > type. > > > > Unless you extend the meaning of "syntax" beyond (my) recognition, you > > won't be able to enforce that constraint using only syntax information. > > > > Roughly, a syntax error is a failure to parse the source code in > > accordance with a grammar that can be defined, for example, in BNF > > (Backus-Naur Form). (C's treatment of typedefs punches a small hole in > > this model.) Any errors that are not syntax errors are what I think of > > as semantic errors. In C, this is more or less expressed as syntax > > rules vs. constraints. > > I understand syntax, perhaps, like a linguist would. In > all the linguistic work I know what fills a slot (a slot > like the identifier slot in a switch statement) can be > sub-categorized to a subset of all identifiers. In this > case the identifier must have the attribute "integer". > The token (already identified as an identifier) is further > sub-categorized by being declared, for example, an "int". > > The source of my concern about what a token actually is > comes from this accumulation of additional attributes - > which would include constant and volatile as well as > static/extern. An identifier token identifies something. The thing that the identifier identifies can have all kinds of attributes - which should be stored in the data structure which corresponds to the thing which the identifier identifies. The same identifier may be used in many different places within the same scope and name space; each of those occurrences is a different token - so the attributes you're talking about shouldn't be associated with those tokens. Consider "i = i*i;" That statement consists of 6 different tokens, three of which are identically spelled and identify the same object. Information about that object should be stored in in whatever structure you use to represent that object, not in the three different structures you use to represent those three tokens. I'm using "structure" here is a generic sense, not a C sense. In your compiler the structure that represents a token is apparently a single integer containing a code value. The information about the object identified by that token (presumably including the spelling of that identifier) is stored in what you called a "token data record". That is why I would prefer to call it an object data record. An identifier token can also identify "a function; a tag or a member of a structure, union, or enumeration; a typedef name; a label name". The information you need to store is quite different in each of those cases.
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-28 22:36 -0700 |
| Message-ID | <018dcabf-62cd-4567-b053-555eb4f6c76b@googlegroups.com> |
| In reply to | #120520 |
On Thursday, September 28, 2017 at 8:29:57 PM UTC-7, james...@verizon.net wrote: > On Thursday, September 28, 2017 at 10:48:03 PM UTC-4, David Kleinecke wrote: > > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote: > > > David Kleinecke <dkleinecke@gmail.com> writes: > > > > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote: > > > >> > > > >> A compiler typically treats syntax and semantics separately. Syntactic > > > >> analysis, or parsing, is often semi-automated, with the parser generated > > > >> programatically from a forma grammar -- or it might be written manually. > > > >> If parsing fails, that's a syntax error. Other errors, like trying to > > > >> apply a shift operator to a pointer value, have to be detected during > > > >> semantic analysis. As far as typical compiler internals are concerned, > > > >> they're fundamentally different kinds of errors. Tweaking the grammar > > > >> to treat some syntax errors as semantic errors can enable better > > > >> diagnostics in some csaes. > > > > > > > > My compilers also treat syntax and semantics separately. The > > > > question is whether the constraints are part of syntax or part > > > > of semantics. > > > > > > Semantics, I'd say. > > > > > > > You imply that some are and some aren't. > > > > > > Did I? > > > > > > > Quite possible > > > > although I have no examples. I have concluded that what > > > > I called the spirit of C expects them with the syntax but > > > > I had them coded with the semantics. I have only begun > > > > moving the constraints and haven't yet found any that > > > > don't fit rather easily into the syntax. > > > > > > I suspect I really don't undrestand what you're saying. > > > > > > Here's an example. C11 6.8.4.2p1 specifies the following constraint: > > > > > > The controlling expression of a switch statement shall have integer > > > type. > > > > > > Unless you extend the meaning of "syntax" beyond (my) recognition, you > > > won't be able to enforce that constraint using only syntax information. > > > > > > Roughly, a syntax error is a failure to parse the source code in > > > accordance with a grammar that can be defined, for example, in BNF > > > (Backus-Naur Form). (C's treatment of typedefs punches a small hole in > > > this model.) Any errors that are not syntax errors are what I think of > > > as semantic errors. In C, this is more or less expressed as syntax > > > rules vs. constraints. > > > > I understand syntax, perhaps, like a linguist would. In > > all the linguistic work I know what fills a slot (a slot > > like the identifier slot in a switch statement) can be > > sub-categorized to a subset of all identifiers. In this > > case the identifier must have the attribute "integer". > > The token (already identified as an identifier) is further > > sub-categorized by being declared, for example, an "int". > > > > The source of my concern about what a token actually is > > comes from this accumulation of additional attributes - > > which would include constant and volatile as well as > > static/extern. > > An identifier token identifies something. The thing that the identifier > identifies can have all kinds of attributes - which should be stored in the data > structure which corresponds to the thing which the identifier identifies. The > same identifier may be used in many different places within the same scope and > name space; each of those occurrences is a different token - so the attributes > you're talking about shouldn't be associated with those tokens. > > Consider "i = i*i;" That statement consists of 6 different tokens, three of > which are identically spelled and identify the same object. Information about > that object should be stored in in whatever structure you use to represent that > object, not in the three different structures you use to represent those three > tokens. I'm using "structure" here is a generic sense, not a C sense. In your > compiler the structure that represents a token is apparently a single integer > containing a code value. The information about the object identified by that > token (presumably including the spelling of that identifier) is stored in what > you called a "token data record". > That is why I would prefer to call it an object data record. An identifier token > can also identify "a function; a tag or a member of a structure, union, or > enumeration; a typedef name; a label name". The information you need to store is > quite different in each of those cases. As nearly as I can tell we are both saying the same thing. I don't see why you insist on "correcting" me.
[toc] | [prev] | [next] | [standalone]
| From | Keith Thompson <kst-u@mib.org> |
|---|---|
| Date | 2017-09-29 08:47 -0700 |
| Message-ID | <lna81dv75a.fsf@kst-u.example.com> |
| In reply to | #120517 |
David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
[...]
>> I suspect I really don't undrestand what you're saying.
>>
>> Here's an example. C11 6.8.4.2p1 specifies the following constraint:
>>
>> The controlling expression of a switch statement shall have integer
>> type.
>>
>> Unless you extend the meaning of "syntax" beyond (my) recognition, you
>> won't be able to enforce that constraint using only syntax information.
>>
>> Roughly, a syntax error is a failure to parse the source code in
>> accordance with a grammar that can be defined, for example, in BNF
>> (Backus-Naur Form). (C's treatment of typedefs punches a small hole in
>> this model.) Any errors that are not syntax errors are what I think of
>> as semantic errors. In C, this is more or less expressed as syntax
>> rules vs. constraints.
>
> I understand syntax, perhaps, like a linguist would. In
> all the linguistic work I know what fills a slot (a slot
> like the identifier slot in a switch statement) can be
> sub-categorized to a subset of all identifiers. In this
> case the identifier must have the attribute "integer".
> The token (already identified as an identifier) is further
> sub-categorized by being declared, for example, an "int".
I'm not a linguist, but I suspect that you're looking at syntax in a way
that's not useful for analyzing C.
There is no "identifier slot" in a switch statement. What follows the
"switch" keyword is an expression. That expression can be arbitrarily
complex, but it must be of some integer type. The attribute "integer",
if you have such a thing, would need to apply to (your internal
representation of) that expression, not to some identifier.
> The source of my concern about what a token actually is
> comes from this accumulation of additional attributes -
> which would include constant and volatile as well as
> static/extern.
In a typical compiler design, input is split into tokens by the "lexer".
Each token is derived from a sequence of characters in the source. One
kind of token is an identifier. At the token level, an identifier has
no type and does not refer to any declaration; those concepts occur
later in processing. All you need to know about it at that point is
that it's an identifier and how it's spelled.
The stream of tokens is consumed by the parser, which does all the
syntactic analysis. The parser might build some data structure that
reflects the syntax (declarations, statements, function definitions,
etc.). That structure may or may not use the token stuctures built by
the lexer, but it will at least need to annotate them with extra
information.
But as for const/volatile/static/extern, those are attributes that
should apply to a declaration, not to an identifier. The parser would
figure out, for example, which declaration a given occurrence of an
identifier refers to.
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
[toc] | [prev] | [next] | [standalone]
| From | David Kleinecke <dkleinecke@gmail.com> |
|---|---|
| Date | 2017-09-29 11:23 -0700 |
| Message-ID | <12e83aa5-0e76-4085-a898-d2f6858d05f0@googlegroups.com> |
| In reply to | #120546 |
On Friday, September 29, 2017 at 8:47:46 AM UTC-7, Keith Thompson wrote: > David Kleinecke <dkleinecke@gmail.com> writes: > > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote: > [...] > >> I suspect I really don't undrestand what you're saying. > >> > >> Here's an example. C11 6.8.4.2p1 specifies the following constraint: > >> > >> The controlling expression of a switch statement shall have integer > >> type. > >> > >> Unless you extend the meaning of "syntax" beyond (my) recognition, you > >> won't be able to enforce that constraint using only syntax information. > >> > >> Roughly, a syntax error is a failure to parse the source code in > >> accordance with a grammar that can be defined, for example, in BNF > >> (Backus-Naur Form). (C's treatment of typedefs punches a small hole in > >> this model.) Any errors that are not syntax errors are what I think of > >> as semantic errors. In C, this is more or less expressed as syntax > >> rules vs. constraints. > > > > I understand syntax, perhaps, like a linguist would. In > > all the linguistic work I know what fills a slot (a slot > > like the identifier slot in a switch statement) can be > > sub-categorized to a subset of all identifiers. In this > > case the identifier must have the attribute "integer". > > The token (already identified as an identifier) is further > > sub-categorized by being declared, for example, an "int". > > I'm not a linguist, but I suspect that you're looking at syntax in a way > that's not useful for analyzing C. > > There is no "identifier slot" in a switch statement. What follows the > "switch" keyword is an expression. That expression can be arbitrarily > complex, but it must be of some integer type. The attribute "integer", > if you have such a thing, would need to apply to (your internal > representation of) that expression, not to some identifier. > > > The source of my concern about what a token actually is > > comes from this accumulation of additional attributes - > > which would include constant and volatile as well as > > static/extern. > > In a typical compiler design, input is split into tokens by the "lexer". > Each token is derived from a sequence of characters in the source. One > kind of token is an identifier. At the token level, an identifier has > no type and does not refer to any declaration; those concepts occur > later in processing. All you need to know about it at that point is > that it's an identifier and how it's spelled. > > The stream of tokens is consumed by the parser, which does all the > syntactic analysis. The parser might build some data structure that > reflects the syntax (declarations, statements, function definitions, > etc.). That structure may or may not use the token stuctures built by > the lexer, but it will at least need to annotate them with extra > information. > > But as for const/volatile/static/extern, those are attributes that > should apply to a declaration, not to an identifier. The parser would > figure out, for example, which declaration a given occurrence of an > identifier refers to. I am, of course, quite aware of the usual way of doing things and BISON and so on. But I have decided that that is not the best way to go - best in the sense of being able to grasp what is going on (and not efficiency or speed or ...). I do not show the parser any idents. All the parser ever sees is what I call tokens. In the current case these are integer indexes into an array of data (of diverse kinds).
[toc] | [prev] | [next] | [standalone]
| From | Ben Bacarisse <ben.usenet@bsb.me.uk> |
|---|---|
| Date | 2017-09-29 18:27 +0100 |
| Message-ID | <87efqpfm9d.fsf@bsb.me.uk> |
| In reply to | #120517 |
David Kleinecke <dkleinecke@gmail.com> writes: > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote: >> David Kleinecke <dkleinecke@gmail.com> writes: >> > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote: >> >> >> >> A compiler typically treats syntax and semantics separately. Syntactic >> >> analysis, or parsing, is often semi-automated, with the parser generated >> >> programatically from a forma grammar -- or it might be written manually. >> >> If parsing fails, that's a syntax error. Other errors, like trying to >> >> apply a shift operator to a pointer value, have to be detected during >> >> semantic analysis. As far as typical compiler internals are concerned, >> >> they're fundamentally different kinds of errors. Tweaking the grammar >> >> to treat some syntax errors as semantic errors can enable better >> >> diagnostics in some csaes. >> > >> > My compilers also treat syntax and semantics separately. The >> > question is whether the constraints are part of syntax or part >> > of semantics. >> >> Semantics, I'd say. >> >> > You imply that some are and some aren't. >> >> Did I? >> >> > Quite possible >> > although I have no examples. I have concluded that what >> > I called the spirit of C expects them with the syntax but >> > I had them coded with the semantics. I have only begun >> > moving the constraints and haven't yet found any that >> > don't fit rather easily into the syntax. >> >> I suspect I really don't undrestand what you're saying. >> >> Here's an example. C11 6.8.4.2p1 specifies the following constraint: >> >> The controlling expression of a switch statement shall have integer >> type. >> >> Unless you extend the meaning of "syntax" beyond (my) recognition, you >> won't be able to enforce that constraint using only syntax information. >> >> Roughly, a syntax error is a failure to parse the source code in >> accordance with a grammar that can be defined, for example, in BNF >> (Backus-Naur Form). (C's treatment of typedefs punches a small hole in >> this model.) Any errors that are not syntax errors are what I think of >> as semantic errors. In C, this is more or less expressed as syntax >> rules vs. constraints. > > I understand syntax, perhaps, like a linguist would. In > all the linguistic work I know what fills a slot (a slot > like the identifier slot in a switch statement) can be > sub-categorized to a subset of all identifiers. In this > case the identifier must have the attribute "integer". > The token (already identified as an identifier) is further > sub-categorized by being declared, for example, an "int". Perhaps you could show us the grammar you use in which the given constraint is a syntax error? Such a grammar does exist as you, as a linguist, will know. It's not the grammar in the C standard, and it may well be so much more complex that it can't be parsed efficiently, but know what it is before embarking on this part of the project will help you immensely. Keith is correct in that most constraints are not syntax errors taking the language syntax to be the described by the grammar in the standard. You are correct in that any set of static properties of a text (and the constraints in the C standard are all checkable by inspecting the text of one translation unit alone) can be turned into a grammar such that violating one or more of them becomes a matter solely of syntax. However, there are sounds reasons why that approach is not taken in most programming languages. What seems more likely is that are plannign to parse the syntax as in the standard C grammar and maintain extra information as you go that can be used to diagnose constraint violations. That's how every C compiler I have ever seen does it. > The source of my concern about what a token actually is > comes from this accumulation of additional attributes - > which would include constant and volatile as well as > static/extern. This sounds like the usual approach. -- Ben.
[toc] | [prev] | [next] | [standalone]
| From | jameskuyper@verizon.net |
|---|---|
| Date | 2017-09-28 09:13 -0700 |
| Message-ID | <8ae5497b-14f0-477e-bd15-ffe7883b6855@googlegroups.com> |
| In reply to | #120422 |
On Thursday, September 28, 2017 at 1:09:24 AM UTC-4, David Kleinecke wrote: > On Wednesday, September 27, 2017 at 9:19:34 PM UTC-7, James Kuyper wrote: > > On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote: > > > David Kleinecke <dkleinecke@gmail.com> writes: ... > > >> I observe that 5.1.1.2 says "decomposed". Perhaps - it > > >> If so the "shall" that every pp-token that becomes a token > > >> must have one the approved shapes leaves us with possibly > > >> many pp-tokens that don't become tokens. I assume that > > >> what happens to them is more undefined behavior. > > > > > > Not really undefined: it will provoke a lexical error with the C > > > compiler. > > > > A constraint violation, to be precise (6.4p2). > > I think C89 has no concept of "constraint violation" or > even of "violation". I can be faulted for using C89 of > course but the C89 section 3 does not make any > distinction between kinds of "shall". This seems to > be nit-picking to me. My copy of C90 doesn't have paragraph or line numbers, so the following citations are not as specific as the ones I usually provide: C90 1.6 starts out with: " In this Standard, ``shall'' is to be interpreted as a requirement on an implementation or on a program; conversely, ``shall not'' is to be interpreted as a prohibition." Under the definition of "undefined behavior", it says: "If a ``shall'' or ``shall not'' requirement that appears outside of a constraint is violated, the behavior is undefined. ..." The fact that this rule applies only to a shall "that appears outside of a constraint" seems pretty clear to me. Farther along in the same section, it says: And then it goes ahead and explicitly defines what "Constraints" means: " * Constraints --- syntactic and semantic restrictions by which the exposition of language elements is to be interpreted." Section 2.1.1.3: "A conforming implementation shall produce at least one diagnostic message (identified in an implementation-defined manner) for every translation unit that contains a violation of any syntax rule or constraint. ..." There are 52 different sections in the C89 standard labelled "Constraints". One of them is under section 3.1, and says the same thing that the current standard says about this particular issue: "Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a constant, a string literal, an operator, or a punctuator." If you were under the impression that C89 had no concepts of Constraints, I can only conclude that you've never bothered to carefully read the document.
[toc] | [prev] | [next] | [standalone]
| From | Richard Damon <Richard@Damon-Family.org> |
|---|---|
| Date | 2017-09-28 08:15 -0400 |
| Message-ID | <GJ5zB.186245$OC1.129884@fx06.iad> |
| In reply to | #120418 |
On 9/27/17 11:33 PM, Pascal J. Bourguignon wrote: > David Kleinecke <dkleinecke@gmail.com> writes: > >> I am having trouble reading the standard on this one small >> point. I will quote the C89 standard but I have checked the >> C11 standard and seems to have exactly the same problem. >> >> The question is - exactly what does the standard mean by >> "lexical form" and "lexical elements". There is no >> definition withing the standard and (apart from a forward >> reference in 5.1.1.2) they are only appear in 6.4 (6.1 >> in C89). >> >> There are expressions like "Each preprocessing token that is >> converted to a token shall have the lexical form of a >> keyword, an identifier, ...". This says each pp-token has >> something called a "lexical form" which seems to have five >> values (six in C89). The pp-tokens in my preprocessor are >> unsigned ints (and are unchanged as tokens. > > n1124.pdf mentions seven pp-tokens: > > header-name > identifier > pp-number > character-constant > string-literal > punctuator > non-white-space that cannot be one of the above > > Assume a pre-processor/compiler that processes UTF-8 sources. So you > can have a string such as "Cet été était chaud." However, été still > cannot be an identifier. (You would have to write it as \u00e9t\u00e9). > > So été is a pp-token (non-white-space that cannot be one of the above), > that is not an acceptable lexical form to be converted into a (C) token. > Small correction, 6.4.2.1 lists for the definition of identifier-nondigit (the characters an identifier is allowed to begin with)): nondigit (the letters and _) universal-character-name other implementation defined characters because of the last term, an implementation is allowed to define that é is part if the implementation defined character set for non-digits, and thus été is possible a valid identifier (subject to the above implementation defined behavior). There is perhaps a slight suggestion that an implementation that accepts wide characters (i.e UTF-8 source files) allow the UTF-8 codes that are the equivalent of the values listed in Annex D as identifier characters.
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.c
csiph-web