Groups > comp.lang.c > #120416 > unrolled thread

Lexical Elements

Started by	David Kleinecke <dkleinecke@gmail.com>
First post	2017-09-27 19:03 -0700
Last post	2017-09-29 01:19 +0100
Articles	20 on this page of 34 — 9 participants

Back to article view | Back to comp.lang.c

  Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 19:03 -0700
    Re: Lexical Elements "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-09-28 05:33 +0200
      Re: Lexical Elements James Kuyper <jameskuyper@verizon.net> - 2017-09-28 00:19 -0400
        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:09 -0700
          Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 08:31 -0700
            Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:53 -0700
              Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:16 -0700
                Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:51 -0700
                  Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:42 -0700
              Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:37 -0700
                Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:16 -0700
                  Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 18:39 -0700
                    Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 19:47 -0700
                      Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 20:29 -0700
                        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 22:36 -0700
                      Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-29 08:47 -0700
                        Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-29 11:23 -0700
                      Re: Lexical Elements Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-09-29 18:27 +0100
          Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:13 -0700
      Re: Lexical Elements Richard Damon <Richard@Damon-Family.org> - 2017-09-28 08:15 -0400
    Re: Lexical Elements jameskuyper@verizon.net - 2017-09-27 21:03 -0700
      Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-27 22:16 -0700
        Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 09:45 -0700
          Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 11:58 -0700
            Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 12:29 -0700
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 15:52 -0700
                Re: Lexical Elements Joe Pfeiffer <pfeiffer@cs.nmsu.edu> - 2017-09-28 17:40 -0600
                Re: Lexical Elements jameskuyper@verizon.net - 2017-09-28 16:54 -0700
            Re: Lexical Elements Keith Thompson <kst-u@mib.org> - 2017-09-28 12:40 -0700
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:12 -0700
            Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 21:04 +0100
              Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-28 22:12 +0100
              Re: Lexical Elements David Kleinecke <dkleinecke@gmail.com> - 2017-09-28 16:15 -0700
                Re: Lexical Elements bartc <bc@freeuk.com> - 2017-09-29 01:19 +0100

Page 1 of 2 [1] 2 Next page →

#120416 — Lexical Elements

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-27 19:03 -0700
Subject	Lexical Elements
Message-ID	<cc7eadf6-6c89-4139-9050-3606d3c0ab01@googlegroups.com>

I am having trouble reading the standard on this one small
point. I will quote the C89 standard but I have checked the
C11 standard and seems to have exactly the same problem.

The question is - exactly what does the standard mean by
"lexical form" and "lexical elements". There is no
definition withing the standard and (apart from a forward
reference in 5.1.1.2) they are only appear in 6.4 (6.1
in C89). 

There are expressions like "Each preprocessing token that is
concerted to a token shall have the lexical form of a
keyword, an identifier, ...". This says each pp-token has
something called a "lexical form" which seems to have five
values (six in C89). The pp-tokens in my preprocessor are
unsigned ints (and are unchanged as tokens.

I observe that 5.1.1.2 says "decomposed". Perhaps - it
seems - the lexical form is the character string that
originated the token (minus the quotes on strings and
characters). Is this reading correct?

If so the "shall" that every pp-token that becomes a token
must have one the approved shapes leaves us with possibly
many pp-tokens that don't become tokens. I assume that
what happens to them is more undefined behavior.

[toc] | [next] | [standalone]

#120418

From	"Pascal J. Bourguignon" <pjb@informatimago.com>
Date	2017-09-28 05:33 +0200
Message-ID	<m2o9pv33am.fsf@despina.home>
In reply to	#120416

David Kleinecke <dkleinecke@gmail.com> writes:

> I am having trouble reading the standard on this one small
> point. I will quote the C89 standard but I have checked the
> C11 standard and seems to have exactly the same problem.
>
> The question is - exactly what does the standard mean by
> "lexical form" and "lexical elements". There is no
> definition withing the standard and (apart from a forward
> reference in 5.1.1.2) they are only appear in 6.4 (6.1
> in C89). 
>
> There are expressions like "Each preprocessing token that is
> converted to a token shall have the lexical form of a
> keyword, an identifier, ...". This says each pp-token has
> something called a "lexical form" which seems to have five
> values (six in C89). The pp-tokens in my preprocessor are
> unsigned ints (and are unchanged as tokens.

n1124.pdf mentions seven pp-tokens:

    header-name
    identifier
    pp-number
    character-constant
    string-literal
    punctuator
    non-white-space that cannot be one of the above

Assume a pre-processor/compiler that processes UTF-8 sources.  So you
can have a string such as "Cet été était chaud."  However, été still
cannot be an identifier. (You would have to write it as \u00e9t\u00e9).

So été is a pp-token  (non-white-space that cannot be one of the above),
that is not an acceptable lexical form to be converted into a (C) token.

Or perhaps more precisely, since AFAICS, été should be interpreted by
the pre-processor as 3 pp-tokens, é t and é, so perhaps it should be:

    é is a pp-token (non-white-space that cannot be one of the above),
    that is not an acceptable lexical form to be converted into a (C)
    token.

> I observe that 5.1.1.2 says "decomposed". Perhaps - it
> seems - the lexical form is the character string that
> originated the token (minus the quotes on strings and
> characters). Is this reading correct?

in n1124.pdf:

   3. The source file is decomposed into preprocessing tokens6) and
      sequences of white-space characters (including comments). A source
      file shall not end in a partial preprocessing token or in a
      partial comment.

This "decomposed" verb only refers to the pre-processor parsing of the
source text.

> If so the "shall" that every pp-token that becomes a token
> must have one the approved shapes leaves us with possibly
> many pp-tokens that don't become tokens. I assume that
> what happens to them is more undefined behavior. 

Not really undefined: it will provoke a lexical error with the C
compiler.

Conceptually, we have two scanners
- the pre-processor scanner,
- the C scanner.
and they don't scan the same tokens! 

You can use the pre-processor independently from the C compiler, on
non-C source files.  It will transform text to text, and will leave
alone anything that it doesn't handle (ie. anything it doesn't parse as
pre-processor directives or macros).  So the pp-token categories must
cover all the possible text, including things like "é" that are lexical
errors for a C compiler.

However, when used as a C compiler front-end, we can avoid the
generation of textual output from the pre-processor, and re-scanning the
whole text by the C compiler.  There's then a direct conversion of the
pre-processor pp-token sequence into a C token sequence.  However, this
conversion can be performed only when the pp-token matches the syntax of
a C token. (From a text source, the C compiler would signal a lexical
error; in the case where this conversion occurs, the converter can
signal the lexical error).

Another example, which is explicitely given in n1124.pdf, is the
pp-number 1Ex.  This pp-token doesn't have the lexical form of a C
number token.  So it cannot be converted.

So the "lexical form" is the _pattern_ that correspond to a C token.

-- 
__Pascal J. Bourguignon
http://www.informatimago.com

[toc] | [prev] | [next] | [standalone]

#120420

From	James Kuyper <jameskuyper@verizon.net>
Date	2017-09-28 00:19 -0400
Message-ID	<oqht8d$ke3$1@dont-email.me>
In reply to	#120418

On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
...
> Assume a pre-processor/compiler that processes UTF-8 sources.  So you
> can have a string such as "Cet �t� �tait chaud."  However, �t� still
> cannot be an identifier. (You would have to write it as \u00e9t\u00e9).

A string literal can contain 'any member of the source character set
except the double-quote ", backslash \, or new-line character'
(6.4.5p1). If you can put those characters in a source code file, they
are allowed in a string literal.

>> I observe that 5.1.1.2 says "decomposed". Perhaps - it
>> If so the "shall" that every pp-token that becomes a token
>> must have one the approved shapes leaves us with possibly
>> many pp-tokens that don't become tokens. I assume that
>> what happens to them is more undefined behavior. 
> 
> Not really undefined: it will provoke a lexical error with the C
> compiler.

A constraint violation, to be precise (6.4p2).

[toc] | [prev] | [next] | [standalone]

#120422

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-27 22:09 -0700
Message-ID	<e1753783-684f-45a1-8600-e0230fa9ba81@googlegroups.com>
In reply to	#120420

On Wednesday, September 27, 2017 at 9:19:34 PM UTC-7, James Kuyper wrote:
> On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote:
> > David Kleinecke <dkleinecke@gmail.com> writes:
> ...
> > Assume a pre-processor/compiler that processes UTF-8 sources.  So you
> > can have a string such as "Cet �t� �tait chaud."  However, �t� still
> > cannot be an identifier. (You would have to write it as \u00e9t\u00e9).
> 
> A string literal can contain 'any member of the source character set
> except the double-quote ", backslash \, or new-line character'
> (6.4.5p1). If you can put those characters in a source code file, they
> are allowed in a string literal.
> 
> >> I observe that 5.1.1.2 says "decomposed". Perhaps - it
> >> If so the "shall" that every pp-token that becomes a token
> >> must have one the approved shapes leaves us with possibly
> >> many pp-tokens that don't become tokens. I assume that
> >> what happens to them is more undefined behavior. 
> > 
> > Not really undefined: it will provoke a lexical error with the C
> > compiler.
> 
> A constraint violation, to be precise (6.4p2).

I think C89 has no concept of "constraint violation" or 
even of "violation". I can be faulted for using C89 of
course but the C89 section 3 does not make any
distinction between kinds of "shall". This seems to
be nit-picking to me.

[toc] | [prev] | [next] | [standalone]

#120446

From	Keith Thompson <kst-u@mib.org>
Date	2017-09-28 08:31 -0700
Message-ID	<lnmv5eyh4k.fsf@kst-u.example.com>
In reply to	#120422

David Kleinecke <dkleinecke@gmail.com> writes:
[...]
> I think C89 has no concept of "constraint violation" or 
> even of "violation". I can be faulted for using C89 of
> course but the C89 section 3 does not make any
> distinction between kinds of "shall". This seems to
> be nit-picking to me.

C89 certainly does have the concept of "constraint violation".
See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):

    A conforming implementation shall produce at least one diagnostic
    message (identified in an implementation-defined manner) for
    every translation unit that contains a violation of any syntax
    rule or constraint.  Diagnostic messages need not be produced
    in other circumstances.

Search your copy of the standard, or draft, or whatever you have, for
the word "constraint".  There are far too many to list here.

The distinction between a "shall" or "shall not" without or outside a
constraint is discussed in the definition of "undefined behavior", C90
3.16.  Later editions moved that discussion to section 4, Conformance.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#120476

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 11:53 -0700
Message-ID	<d7b80520-7c0a-461f-99db-1b58dac1ee15@googlegroups.com>
In reply to	#120446

On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
> [...]
> > I think C89 has no concept of "constraint violation" or 
> > even of "violation". I can be faulted for using C89 of
> > course but the C89 section 3 does not make any
> > distinction between kinds of "shall". This seems to
> > be nit-picking to me.
> 
> C89 certainly does have the concept of "constraint violation".
> See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
> 
>     A conforming implementation shall produce at least one diagnostic
>     message (identified in an implementation-defined manner) for
>     every translation unit that contains a violation of any syntax
>     rule or constraint.  Diagnostic messages need not be produced
>     in other circumstances.

This is a nit almost too small to pick. It's 5.1.1.3 in C90. I
suspect you were looking at the pre-boilerplate version (when the
first three sections were added.) If so I indeed made an error.
I should have said C90 not C89.

Note that it does not differentiate between which kind of violation
occurred. I read "violation" here as the ordinary English word (it
is not in the index) and not a C technical term.

I think that this and other passages strongly suggest that I should
consider syntax and constraints together as an integrated system if
I want to get into the spirit of C. There is no a priori reason for
doing so and my old compiler integrated the constraints with code
generation rather than syntax. I am trying to move the constraints
over to the same processing as the syntax and that's where I 
stumbled over the concept of "lexical form.

[toc] | [prev] | [next] | [standalone]

#120480

From	jameskuyper@verizon.net
Date	2017-09-28 12:16 -0700
Message-ID	<eaed35b6-a83e-49e9-b2af-f9eff330ed56@googlegroups.com>
In reply to	#120476

On Thursday, September 28, 2017 at 2:53:59 PM UTC-4, David Kleinecke wrote:
> On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote:
> > David Kleinecke <dkleinecke@gmail.com> writes:
> > [...]
> > > I think C89 has no concept of "constraint violation" or 
> > > even of "violation". I can be faulted for using C89 of
> > > course but the C89 section 3 does not make any
> > > distinction between kinds of "shall". This seems to
> > > be nit-picking to me.
> > 
> > C89 certainly does have the concept of "constraint violation".
> > See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
> > 
> >     A conforming implementation shall produce at least one diagnostic
> >     message (identified in an implementation-defined manner) for
> >     every translation unit that contains a violation of any syntax
> >     rule or constraint.  Diagnostic messages need not be produced
> >     in other circumstances.
>  
> This is a nit almost too small to pick. It's 5.1.1.3 in C90. I
> suspect you were looking at the pre-boilerplate version (when the
> first three sections were added.) If so I indeed made an error.
> I should have said C90 not C89.

All of the relevant wording is the same in C89 and C90 - only the section
numbers are different. Therefore, the fact that what you said is false about
about C89, means that it is also false about C90.

> Note that it does not differentiate between which kind of violation
> occurred. I read "violation" here as the ordinary English word (it
> is not in the index) and not a C technical term.

In particular, the wording that does distinguish between "shall" when it occurs
in a constraint, and "shall" when it appears in other parts of the standard, is
exactly the same:

"If a ``shall'' or ``shall not'' requirement that appears outside of
   a constraint is violated, the behavior is undefined. ..."

Please acknowledge the fact that is says "outside of a constraint" - you seem to
be claiming that there's no such wording, despite it having been cited to you
several times.

> I think that this and other passages strongly suggest that I should
> consider syntax and constraints together as an integrated system if
> I want to get into the spirit of C.

Yes, the C standard says exactly the same thing about violations of syntax rules
that it does about violations of constraints: "A conforming implementation shall
produce at least one diagnostic message ...". Many constraint violations could
be converted into syntax errors by making the grammar substantially more
complicated. Many syntax errors could have  been described as constraints, at
the cost of making the grammar significantly less meaningful.

[toc] | [prev] | [next] | [standalone]

#120504

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 15:51 -0700
Message-ID	<c31f4a50-807c-46b1-86e7-928013546880@googlegroups.com>
In reply to	#120480

On Thursday, September 28, 2017 at 12:16:52 PM UTC-7, james...@verizon.net wrote:
> On Thursday, September 28, 2017 at 2:53:59 PM UTC-4, David Kleinecke wrote:
> > On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote:
> > > David Kleinecke <dkleinecke@gmail.com> writes:
> > > [...]
> > > > I think C89 has no concept of "constraint violation" or 
> > > > even of "violation". I can be faulted for using C89 of
> > > > course but the C89 section 3 does not make any
> > > > distinction between kinds of "shall". This seems to
> > > > be nit-picking to me.
> > > 
> > > C89 certainly does have the concept of "constraint violation".
> > > See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
> > > 
> > >     A conforming implementation shall produce at least one diagnostic
> > >     message (identified in an implementation-defined manner) for
> > >     every translation unit that contains a violation of any syntax
> > >     rule or constraint.  Diagnostic messages need not be produced
> > >     in other circumstances.
> >  
> > This is a nit almost too small to pick. It's 5.1.1.3 in C90. I
> > suspect you were looking at the pre-boilerplate version (when the
> > first three sections were added.) If so I indeed made an error.
> > I should have said C90 not C89.
> 
> All of the relevant wording is the same in C89 and C90 - only the section
> numbers are different. Therefore, the fact that what you said is false about
> about C89, means that it is also false about C90.
> 
> > Note that it does not differentiate between which kind of violation
> > occurred. I read "violation" here as the ordinary English word (it
> > is not in the index) and not a C technical term.
> 
> In particular, the wording that does distinguish between "shall" when it occurs
> in a constraint, and "shall" when it appears in other parts of the standard, is
> exactly the same:
> 
> "If a ``shall'' or ``shall not'' requirement that appears outside of
>    a constraint is violated, the behavior is undefined. ..."
> 
> Please acknowledge the fact that is says "outside of a constraint" - you seem to
> be claiming that there's no such wording, despite it having been cited to you
> several times.
> 
> > I think that this and other passages strongly suggest that I should
> > consider syntax and constraints together as an integrated system if
> > I want to get into the spirit of C.
> 
> Yes, the C standard says exactly the same thing about violations of syntax rules
> that it does about violations of constraints: "A conforming implementation shall
> produce at least one diagnostic message ...". Many constraint violations could
> be converted into syntax errors by making the grammar substantially more
> complicated. Many syntax errors could have  been described as constraints, at
> the cost of making the grammar significantly less meaningful.

I see the wording you refer to (3.16 in my copy of C90) but
all it seems to me to say is that a constraint violation might
not be undefined behavior. But the next sentence seems to me
to say that is undefined behavior. It's hard to see how it 
could fail to be undefined behavior. In my opinion this is all
just sloppy wording created when the first three sections
were added and there is nothing special about constraint
violations.

[toc] | [prev] | [next] | [standalone]

#120512

From	jameskuyper@verizon.net
Date	2017-09-28 16:42 -0700
Message-ID	<f0abdabf-19a6-4a10-b462-9f67b3628b15@googlegroups.com>
In reply to	#120504

On Thursday, September 28, 2017 at 6:51:27 PM UTC-4, David Kleinecke wrote:
> On Thursday, September 28, 2017 at 12:16:52 PM UTC-7, james...@verizon.net wrote:
...
> > "If a ``shall'' or ``shall not'' requirement that appears outside of
> >    a constraint is violated, the behavior is undefined. ..."
...
> I see the wording you refer to (3.16 in my copy of C90) but
> all it seems to me to say is that a constraint violation might
> not be undefined behavior.

That says nothing one way or the other about the constraint violations, it only talks about things that are NOT constraint violations.

> ... But the next sentence seems to me
> to say that is undefined behavior.

The sentence I've quoted above is part of the definition of "undefined behavior", and identifies one of the three ways the C standard marks something as undefined behavior. The "next sentence" you refer to lists the other two ways: "Undefined behavior is otherwise indicated in this Standard by the words ``undefined behavior'' or by the omission of any explicit definition of behavior.". Neither of those ways is specific to constraint violations. The committee intends that every list of options provided by the C standard be exhaustive - any case where that is not true constitutes a defect in the standard. It follows that only things identified as having undefined behavior in one of these three ways, have undefined behavior.

I believe that it's the case that no constraint rule specifies "undefined behavior" explicitly, so the second case wouldn't apply. However, it also happens to be the case that the standard never provides an explicit definition of the behavior when a constraint is violated. I never realized this myself, I had to have someone else point it out to me, after which I checked to make sure. It's not an inherent feature of constraint rules, it's simply something that happens to be true about all current constraint rules, and could be false for the very next such rule that the committee creates or modifies.

Therefore, after generating the required diagnostic, if an implementation chooses to continue translating the program, and if the user chooses to execute the translated program, the behavior is indeed undefined. But it's not ONLY undefined - a diagnostic is also required.

>  It's hard to see how it 
> could fail to be undefined behavior. In my opinion this is all
> just sloppy wording created when the first three sections
> were added ... 

I don't have a copy of C90, only C89. All of the text I quoted for you is from C89, prior to the addition of those three sections.

> ... and there is nothing special about constraint
> violations.

There is something very special about constraint violations. Along with syntax rule violations, they are the only things to which the following requirement applies: "A conforming implementation shall produce at least one diagnostic message ...". (C90 2.1.1.3)

[toc] | [prev] | [next] | [standalone]

#120482

From	Keith Thompson <kst-u@mib.org>
Date	2017-09-28 12:37 -0700
Message-ID	<lnk20iwr59.fsf@kst-u.example.com>
In reply to	#120476

David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 8:31:32 AM UTC-7, Keith Thompson wrote:
>> David Kleinecke <dkleinecke@gmail.com> writes:
>> [...]
>> > I think C89 has no concept of "constraint violation" or 
>> > even of "violation". I can be faulted for using C89 of
>> > course but the C89 section 3 does not make any
>> > distinction between kinds of "shall". This seems to
>> > be nit-picking to me.
>> 
>> C89 certainly does have the concept of "constraint violation".
>> See for example C89 2.1.1.3 Diagnostics (5.1.1.3 in C90):
>> 
>>     A conforming implementation shall produce at least one diagnostic
>>     message (identified in an implementation-defined manner) for
>>     every translation unit that contains a violation of any syntax
>>     rule or constraint.  Diagnostic messages need not be produced
>>     in other circumstances.
>  
> This is a nit almost too small to pick. It's 5.1.1.3 in C90. I
> suspect you were looking at the pre-boilerplate version (when the
> first three sections were added.) If so I indeed made an error.
> I should have said C90 not C89.

I wouldn't say it's *almost* too small to pick.  I already quoted the
section numbers for both C89 and C90.

> Note that it does not differentiate between which kind of violation
> occurred. I read "violation" here as the ordinary English word (it
> is not in the index) and not a C technical term.

Correct.  The syntax used by a compiler's parser needn't be 100%
consistent with the language grammar defined by the standard, as long as
correct code is parsed and analyzed correctly and required diagnostics
are issued.

> I think that this and other passages strongly suggest that I should
> consider syntax and constraints together as an integrated system if
> I want to get into the spirit of C. There is no a priori reason for
> doing so and my old compiler integrated the constraints with code
> generation rather than syntax. I am trying to move the constraints
> over to the same processing as the syntax and that's where I 
> stumbled over the concept of "lexical form.

I don't know how you'd treat them as an "integrated system", though I
suppose it depends on just what you mean by that.

A compiler typically treats syntax and semantics separately.  Syntactic
analysis, or parsing, is often semi-automated, with the parser generated
programatically from a forma grammar -- or it might be written manually.
If parsing fails, that's a syntax error.  Other errors, like trying to
apply a shift operator to a pointer value, have to be detected during
semantic analysis.  As far as typical compiler internals are concerned,
they're fundamentally different kinds of errors.  Tweaking the grammar
to treat some syntax errors as semantic errors can enable better
diagnostics in some csaes.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#120509

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 16:16 -0700
Message-ID	<d1352da5-c14d-445d-953a-b623686873f2@googlegroups.com>
In reply to	#120482

On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
> 
> A compiler typically treats syntax and semantics separately.  Syntactic
> analysis, or parsing, is often semi-automated, with the parser generated
> programatically from a forma grammar -- or it might be written manually.
> If parsing fails, that's a syntax error.  Other errors, like trying to
> apply a shift operator to a pointer value, have to be detected during
> semantic analysis.  As far as typical compiler internals are concerned,
> they're fundamentally different kinds of errors.  Tweaking the grammar
> to treat some syntax errors as semantic errors can enable better
> diagnostics in some csaes.

My compilers also treat syntax and semantics separately. The
question is whether the constraints are part of syntax or part
of semantics. 

You imply that some are and some aren't. Quite possible
although I have no examples. I have concluded that what
I called the spirit of C expects them with the syntax but
I had them coded with the semantics. I have only begun
moving the constraints and haven't yet found any that
don't fit rather easily into the syntax. 

My thinking about syntax is strongly influenced by the
tagmemic linguistics of Kenneth Pike and others.

[toc] | [prev] | [next] | [standalone]

#120515

From	Keith Thompson <kst-u@mib.org>
Date	2017-09-28 18:39 -0700
Message-ID	<lning2uvtc.fsf@kst-u.example.com>
In reply to	#120509

David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
>> 
>> A compiler typically treats syntax and semantics separately.  Syntactic
>> analysis, or parsing, is often semi-automated, with the parser generated
>> programatically from a forma grammar -- or it might be written manually.
>> If parsing fails, that's a syntax error.  Other errors, like trying to
>> apply a shift operator to a pointer value, have to be detected during
>> semantic analysis.  As far as typical compiler internals are concerned,
>> they're fundamentally different kinds of errors.  Tweaking the grammar
>> to treat some syntax errors as semantic errors can enable better
>> diagnostics in some csaes.
>  
> My compilers also treat syntax and semantics separately. The
> question is whether the constraints are part of syntax or part
> of semantics. 

Semantics, I'd say.

> You imply that some are and some aren't.

Did I?

>                                          Quite possible
> although I have no examples. I have concluded that what
> I called the spirit of C expects them with the syntax but
> I had them coded with the semantics. I have only begun
> moving the constraints and haven't yet found any that
> don't fit rather easily into the syntax. 

I suspect I really don't undrestand what you're saying.

Here's an example.  C11 6.8.4.2p1 specifies the following constraint:

    The controlling expression of a switch statement shall have integer
    type.

Unless you extend the meaning of "syntax" beyond (my) recognition, you
won't be able to enforce that constraint using only syntax information.

Roughly, a syntax error is a failure to parse the source code in
accordance with a grammar that can be defined, for example, in BNF
(Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
this model.)  Any errors that are not syntax errors are what I think of
as semantic errors.  In C, this is more or less expressed as syntax
rules vs. constraints.

[...]

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#120517

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 19:47 -0700
Message-ID	<2d14fe25-9390-4711-b8e2-1cdefc3d56bf@googlegroups.com>
In reply to	#120515

On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
> > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
> >> 
> >> A compiler typically treats syntax and semantics separately.  Syntactic
> >> analysis, or parsing, is often semi-automated, with the parser generated
> >> programatically from a forma grammar -- or it might be written manually.
> >> If parsing fails, that's a syntax error.  Other errors, like trying to
> >> apply a shift operator to a pointer value, have to be detected during
> >> semantic analysis.  As far as typical compiler internals are concerned,
> >> they're fundamentally different kinds of errors.  Tweaking the grammar
> >> to treat some syntax errors as semantic errors can enable better
> >> diagnostics in some csaes.
> >  
> > My compilers also treat syntax and semantics separately. The
> > question is whether the constraints are part of syntax or part
> > of semantics. 
> 
> Semantics, I'd say.
> 
> > You imply that some are and some aren't.
> 
> Did I?
> 
> >                                          Quite possible
> > although I have no examples. I have concluded that what
> > I called the spirit of C expects them with the syntax but
> > I had them coded with the semantics. I have only begun
> > moving the constraints and haven't yet found any that
> > don't fit rather easily into the syntax. 
> 
> I suspect I really don't undrestand what you're saying.
> 
> Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
> 
>     The controlling expression of a switch statement shall have integer
>     type.
> 
> Unless you extend the meaning of "syntax" beyond (my) recognition, you
> won't be able to enforce that constraint using only syntax information.
> 
> Roughly, a syntax error is a failure to parse the source code in
> accordance with a grammar that can be defined, for example, in BNF
> (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
> this model.)  Any errors that are not syntax errors are what I think of
> as semantic errors.  In C, this is more or less expressed as syntax
> rules vs. constraints.

I understand syntax, perhaps, like a linguist would. In
all the linguistic work I know what fills a slot (a slot
like the identifier slot in a switch statement) can be
sub-categorized to a subset of all identifiers. In this
case the identifier must have the attribute "integer".
The token (already identified as an identifier) is further
sub-categorized by being declared, for example, an "int".

The source of my concern about what a token actually is
comes from this accumulation of additional attributes -
which would include constant and volatile as well as
static/extern.

[toc] | [prev] | [next] | [standalone]

#120520

From	jameskuyper@verizon.net
Date	2017-09-28 20:29 -0700
Message-ID	<9ab07b89-0566-4dab-8934-fee01c772883@googlegroups.com>
In reply to	#120517

On Thursday, September 28, 2017 at 10:48:03 PM UTC-4, David Kleinecke wrote:
> On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
> > David Kleinecke <dkleinecke@gmail.com> writes:
> > > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
> > >> 
> > >> A compiler typically treats syntax and semantics separately.  Syntactic
> > >> analysis, or parsing, is often semi-automated, with the parser generated
> > >> programatically from a forma grammar -- or it might be written manually.
> > >> If parsing fails, that's a syntax error.  Other errors, like trying to
> > >> apply a shift operator to a pointer value, have to be detected during
> > >> semantic analysis.  As far as typical compiler internals are concerned,
> > >> they're fundamentally different kinds of errors.  Tweaking the grammar
> > >> to treat some syntax errors as semantic errors can enable better
> > >> diagnostics in some csaes.
> > >  
> > > My compilers also treat syntax and semantics separately. The
> > > question is whether the constraints are part of syntax or part
> > > of semantics. 
> > 
> > Semantics, I'd say.
> > 
> > > You imply that some are and some aren't.
> > 
> > Did I?
> > 
> > >                                          Quite possible
> > > although I have no examples. I have concluded that what
> > > I called the spirit of C expects them with the syntax but
> > > I had them coded with the semantics. I have only begun
> > > moving the constraints and haven't yet found any that
> > > don't fit rather easily into the syntax. 
> > 
> > I suspect I really don't undrestand what you're saying.
> > 
> > Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
> > 
> >     The controlling expression of a switch statement shall have integer
> >     type.
> > 
> > Unless you extend the meaning of "syntax" beyond (my) recognition, you
> > won't be able to enforce that constraint using only syntax information.
> > 
> > Roughly, a syntax error is a failure to parse the source code in
> > accordance with a grammar that can be defined, for example, in BNF
> > (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
> > this model.)  Any errors that are not syntax errors are what I think of
> > as semantic errors.  In C, this is more or less expressed as syntax
> > rules vs. constraints.
> 
> I understand syntax, perhaps, like a linguist would. In
> all the linguistic work I know what fills a slot (a slot
> like the identifier slot in a switch statement) can be
> sub-categorized to a subset of all identifiers. In this
> case the identifier must have the attribute "integer".
> The token (already identified as an identifier) is further
> sub-categorized by being declared, for example, an "int".
> 
> The source of my concern about what a token actually is
> comes from this accumulation of additional attributes -
> which would include constant and volatile as well as
> static/extern.

An identifier token identifies something. The thing that the identifier
identifies can have all kinds of attributes - which should be stored in the data
structure which corresponds to the thing which the identifier identifies. The
same identifier may be used in many different places within the same scope and
name space; each of those occurrences is a different token - so the attributes
you're talking about shouldn't be associated with those tokens.

Consider "i = i*i;" That statement consists of 6 different tokens, three of
which are identically spelled and identify the same object. Information about
that object should be stored in in whatever structure you use to represent that
object, not in the three different structures you use to represent those three
tokens. I'm using "structure" here is a generic sense, not a C sense. In your
compiler the structure that represents a token is apparently a single integer
containing a code value. The information about the object identified by that
token (presumably including the spelling of that identifier) is stored in what
you called a "token data record".
That is why I would prefer to call it an object data record. An identifier token
can also identify "a function; a tag or a member of a structure, union, or
enumeration; a typedef name; a label name". The information you need to store is
quite different in each of those cases.

[toc] | [prev] | [next] | [standalone]

#120524

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-28 22:36 -0700
Message-ID	<018dcabf-62cd-4567-b053-555eb4f6c76b@googlegroups.com>
In reply to	#120520

On Thursday, September 28, 2017 at 8:29:57 PM UTC-7, james...@verizon.net wrote:
> On Thursday, September 28, 2017 at 10:48:03 PM UTC-4, David Kleinecke wrote:
> > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
> > > David Kleinecke <dkleinecke@gmail.com> writes:
> > > > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
> > > >> 
> > > >> A compiler typically treats syntax and semantics separately.  Syntactic
> > > >> analysis, or parsing, is often semi-automated, with the parser generated
> > > >> programatically from a forma grammar -- or it might be written manually.
> > > >> If parsing fails, that's a syntax error.  Other errors, like trying to
> > > >> apply a shift operator to a pointer value, have to be detected during
> > > >> semantic analysis.  As far as typical compiler internals are concerned,
> > > >> they're fundamentally different kinds of errors.  Tweaking the grammar
> > > >> to treat some syntax errors as semantic errors can enable better
> > > >> diagnostics in some csaes.
> > > >  
> > > > My compilers also treat syntax and semantics separately. The
> > > > question is whether the constraints are part of syntax or part
> > > > of semantics. 
> > > 
> > > Semantics, I'd say.
> > > 
> > > > You imply that some are and some aren't.
> > > 
> > > Did I?
> > > 
> > > >                                          Quite possible
> > > > although I have no examples. I have concluded that what
> > > > I called the spirit of C expects them with the syntax but
> > > > I had them coded with the semantics. I have only begun
> > > > moving the constraints and haven't yet found any that
> > > > don't fit rather easily into the syntax. 
> > > 
> > > I suspect I really don't undrestand what you're saying.
> > > 
> > > Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
> > > 
> > >     The controlling expression of a switch statement shall have integer
> > >     type.
> > > 
> > > Unless you extend the meaning of "syntax" beyond (my) recognition, you
> > > won't be able to enforce that constraint using only syntax information.
> > > 
> > > Roughly, a syntax error is a failure to parse the source code in
> > > accordance with a grammar that can be defined, for example, in BNF
> > > (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
> > > this model.)  Any errors that are not syntax errors are what I think of
> > > as semantic errors.  In C, this is more or less expressed as syntax
> > > rules vs. constraints.
> > 
> > I understand syntax, perhaps, like a linguist would. In
> > all the linguistic work I know what fills a slot (a slot
> > like the identifier slot in a switch statement) can be
> > sub-categorized to a subset of all identifiers. In this
> > case the identifier must have the attribute "integer".
> > The token (already identified as an identifier) is further
> > sub-categorized by being declared, for example, an "int".
> > 
> > The source of my concern about what a token actually is
> > comes from this accumulation of additional attributes -
> > which would include constant and volatile as well as
> > static/extern.
> 
> An identifier token identifies something. The thing that the identifier
> identifies can have all kinds of attributes - which should be stored in the data
> structure which corresponds to the thing which the identifier identifies. The
> same identifier may be used in many different places within the same scope and
> name space; each of those occurrences is a different token - so the attributes
> you're talking about shouldn't be associated with those tokens.
> 
> Consider "i = i*i;" That statement consists of 6 different tokens, three of
> which are identically spelled and identify the same object. Information about
> that object should be stored in in whatever structure you use to represent that
> object, not in the three different structures you use to represent those three
> tokens. I'm using "structure" here is a generic sense, not a C sense. In your
> compiler the structure that represents a token is apparently a single integer
> containing a code value. The information about the object identified by that
> token (presumably including the spelling of that identifier) is stored in what
> you called a "token data record".
> That is why I would prefer to call it an object data record. An identifier token
> can also identify "a function; a tag or a member of a structure, union, or
> enumeration; a typedef name; a label name". The information you need to store is
> quite different in each of those cases.

As nearly as I can tell we are both saying the same thing. I
don't see why you insist on "correcting" me.

[toc] | [prev] | [next] | [standalone]

#120546

From	Keith Thompson <kst-u@mib.org>
Date	2017-09-29 08:47 -0700
Message-ID	<lna81dv75a.fsf@kst-u.example.com>
In reply to	#120517

David Kleinecke <dkleinecke@gmail.com> writes:
> On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
[...]
>> I suspect I really don't undrestand what you're saying.
>> 
>> Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
>> 
>>     The controlling expression of a switch statement shall have integer
>>     type.
>> 
>> Unless you extend the meaning of "syntax" beyond (my) recognition, you
>> won't be able to enforce that constraint using only syntax information.
>> 
>> Roughly, a syntax error is a failure to parse the source code in
>> accordance with a grammar that can be defined, for example, in BNF
>> (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
>> this model.)  Any errors that are not syntax errors are what I think of
>> as semantic errors.  In C, this is more or less expressed as syntax
>> rules vs. constraints.
>
> I understand syntax, perhaps, like a linguist would. In
> all the linguistic work I know what fills a slot (a slot
> like the identifier slot in a switch statement) can be
> sub-categorized to a subset of all identifiers. In this
> case the identifier must have the attribute "integer".
> The token (already identified as an identifier) is further
> sub-categorized by being declared, for example, an "int".

I'm not a linguist, but I suspect that you're looking at syntax in a way
that's not useful for analyzing C.

There is no "identifier slot" in a switch statement.  What follows the
"switch" keyword is an expression.  That expression can be arbitrarily
complex, but it must be of some integer type.  The attribute "integer",
if you have such a thing, would need to apply to (your internal
representation of) that expression, not to some identifier.

> The source of my concern about what a token actually is
> comes from this accumulation of additional attributes -
> which would include constant and volatile as well as
> static/extern. 

In a typical compiler design, input is split into tokens by the "lexer".
Each token is derived from a sequence of characters in the source.  One
kind of token is an identifier.  At the token level, an identifier has
no type and does not refer to any declaration; those concepts occur
later in processing.  All you need to know about it at that point is
that it's an identifier and how it's spelled.

The stream of tokens is consumed by the parser, which does all the
syntactic analysis.  The parser might build some data structure that
reflects the syntax (declarations, statements, function definitions,
etc.).  That structure may or may not use the token stuctures built by
the lexer, but it will at least need to annotate them with extra
information.

But as for const/volatile/static/extern, those are attributes that
should apply to a declaration, not to an identifier.  The parser would
figure out, for example, which declaration a given occurrence of an
identifier refers to.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

#120565

From	David Kleinecke <dkleinecke@gmail.com>
Date	2017-09-29 11:23 -0700
Message-ID	<12e83aa5-0e76-4085-a898-d2f6858d05f0@googlegroups.com>
In reply to	#120546

On Friday, September 29, 2017 at 8:47:46 AM UTC-7, Keith Thompson wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
> > On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
> [...]
> >> I suspect I really don't undrestand what you're saying.
> >> 
> >> Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
> >> 
> >>     The controlling expression of a switch statement shall have integer
> >>     type.
> >> 
> >> Unless you extend the meaning of "syntax" beyond (my) recognition, you
> >> won't be able to enforce that constraint using only syntax information.
> >> 
> >> Roughly, a syntax error is a failure to parse the source code in
> >> accordance with a grammar that can be defined, for example, in BNF
> >> (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
> >> this model.)  Any errors that are not syntax errors are what I think of
> >> as semantic errors.  In C, this is more or less expressed as syntax
> >> rules vs. constraints.
> >
> > I understand syntax, perhaps, like a linguist would. In
> > all the linguistic work I know what fills a slot (a slot
> > like the identifier slot in a switch statement) can be
> > sub-categorized to a subset of all identifiers. In this
> > case the identifier must have the attribute "integer".
> > The token (already identified as an identifier) is further
> > sub-categorized by being declared, for example, an "int".
> 
> I'm not a linguist, but I suspect that you're looking at syntax in a way
> that's not useful for analyzing C.
> 
> There is no "identifier slot" in a switch statement.  What follows the
> "switch" keyword is an expression.  That expression can be arbitrarily
> complex, but it must be of some integer type.  The attribute "integer",
> if you have such a thing, would need to apply to (your internal
> representation of) that expression, not to some identifier.
> 
> > The source of my concern about what a token actually is
> > comes from this accumulation of additional attributes -
> > which would include constant and volatile as well as
> > static/extern. 
> 
> In a typical compiler design, input is split into tokens by the "lexer".
> Each token is derived from a sequence of characters in the source.  One
> kind of token is an identifier.  At the token level, an identifier has
> no type and does not refer to any declaration; those concepts occur
> later in processing.  All you need to know about it at that point is
> that it's an identifier and how it's spelled.
> 
> The stream of tokens is consumed by the parser, which does all the
> syntactic analysis.  The parser might build some data structure that
> reflects the syntax (declarations, statements, function definitions,
> etc.).  That structure may or may not use the token stuctures built by
> the lexer, but it will at least need to annotate them with extra
> information.
> 
> But as for const/volatile/static/extern, those are attributes that
> should apply to a declaration, not to an identifier.  The parser would
> figure out, for example, which declaration a given occurrence of an
> identifier refers to.
 
I am, of course, quite aware of the usual way of doing things 
and BISON and so on. But I have decided that that is not the
best way to go - best in the sense of being able to grasp what
is going on (and not efficiency or speed or ...).

I do not show the parser any idents. All the parser ever sees
is what I call tokens. In the current case these are integer
indexes into an array of data (of diverse kinds).

[toc] | [prev] | [next] | [standalone]

#120555

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-09-29 18:27 +0100
Message-ID	<87efqpfm9d.fsf@bsb.me.uk>
In reply to	#120517

David Kleinecke <dkleinecke@gmail.com> writes:

> On Thursday, September 28, 2017 at 6:40:08 PM UTC-7, Keith Thompson wrote:
>> David Kleinecke <dkleinecke@gmail.com> writes:
>> > On Thursday, September 28, 2017 at 12:38:02 PM UTC-7, Keith Thompson wrote:
>> >> 
>> >> A compiler typically treats syntax and semantics separately.  Syntactic
>> >> analysis, or parsing, is often semi-automated, with the parser generated
>> >> programatically from a forma grammar -- or it might be written manually.
>> >> If parsing fails, that's a syntax error.  Other errors, like trying to
>> >> apply a shift operator to a pointer value, have to be detected during
>> >> semantic analysis.  As far as typical compiler internals are concerned,
>> >> they're fundamentally different kinds of errors.  Tweaking the grammar
>> >> to treat some syntax errors as semantic errors can enable better
>> >> diagnostics in some csaes.
>> >  
>> > My compilers also treat syntax and semantics separately. The
>> > question is whether the constraints are part of syntax or part
>> > of semantics. 
>> 
>> Semantics, I'd say.
>> 
>> > You imply that some are and some aren't.
>> 
>> Did I?
>> 
>> >                                          Quite possible
>> > although I have no examples. I have concluded that what
>> > I called the spirit of C expects them with the syntax but
>> > I had them coded with the semantics. I have only begun
>> > moving the constraints and haven't yet found any that
>> > don't fit rather easily into the syntax. 
>> 
>> I suspect I really don't undrestand what you're saying.
>> 
>> Here's an example.  C11 6.8.4.2p1 specifies the following constraint:
>> 
>>     The controlling expression of a switch statement shall have integer
>>     type.
>> 
>> Unless you extend the meaning of "syntax" beyond (my) recognition, you
>> won't be able to enforce that constraint using only syntax information.
>> 
>> Roughly, a syntax error is a failure to parse the source code in
>> accordance with a grammar that can be defined, for example, in BNF
>> (Backus-Naur Form).  (C's treatment of typedefs punches a small hole in
>> this model.)  Any errors that are not syntax errors are what I think of
>> as semantic errors.  In C, this is more or less expressed as syntax
>> rules vs. constraints.
>
> I understand syntax, perhaps, like a linguist would. In
> all the linguistic work I know what fills a slot (a slot
> like the identifier slot in a switch statement) can be
> sub-categorized to a subset of all identifiers. In this
> case the identifier must have the attribute "integer".
> The token (already identified as an identifier) is further
> sub-categorized by being declared, for example, an "int".

Perhaps you could show us the grammar you use in which the given
constraint is a syntax error?

Such a grammar does exist as you, as a linguist, will know.  It's not
the grammar in the C standard, and it may well be so much more complex
that it can't be parsed efficiently, but know what it is before
embarking on this part of the project will help you immensely.

Keith is correct in that most constraints are not syntax errors taking
the language syntax to be the described by the grammar in the standard.
You are correct in that any set of static properties of a text (and the
constraints in the C standard are all checkable by inspecting the text
of one translation unit alone) can be turned into a grammar such that
violating one or more of them becomes a matter solely of syntax.

However, there are sounds reasons why that approach is not taken in most
programming languages.

What seems more likely is that are plannign to parse the syntax as in
the standard C grammar and maintain extra information as you go that can
be used to diagnose constraint violations.  That's how every C compiler
I have ever seen does it.

> The source of my concern about what a token actually is
> comes from this accumulation of additional attributes -
> which would include constant and volatile as well as
> static/extern.

This sounds like the usual approach.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#120450

From	jameskuyper@verizon.net
Date	2017-09-28 09:13 -0700
Message-ID	<8ae5497b-14f0-477e-bd15-ffe7883b6855@googlegroups.com>
In reply to	#120422

On Thursday, September 28, 2017 at 1:09:24 AM UTC-4, David Kleinecke wrote:
> On Wednesday, September 27, 2017 at 9:19:34 PM UTC-7, James Kuyper wrote:
> > On 09/27/2017 11:33 PM, Pascal J. Bourguignon wrote:
> > > David Kleinecke <dkleinecke@gmail.com> writes:
...
> > >> I observe that 5.1.1.2 says "decomposed". Perhaps - it
> > >> If so the "shall" that every pp-token that becomes a token
> > >> must have one the approved shapes leaves us with possibly
> > >> many pp-tokens that don't become tokens. I assume that
> > >> what happens to them is more undefined behavior. 
> > > 
> > > Not really undefined: it will provoke a lexical error with the C
> > > compiler.
> > 
> > A constraint violation, to be precise (6.4p2).
> 
> I think C89 has no concept of "constraint violation" or 
> even of "violation". I can be faulted for using C89 of
> course but the C89 section 3 does not make any
> distinction between kinds of "shall". This seems to
> be nit-picking to me.

My copy of C90 doesn't have paragraph or line numbers, so the following
citations are not as specific as the ones I usually provide:

C90 1.6 starts out with:
"   In this Standard, ``shall'' is to be interpreted as a requirement
on an implementation or on a program; conversely, ``shall not'' is to
be interpreted as a prohibition."

Under the definition of "undefined behavior", it says:
"If a ``shall'' or ``shall not'' requirement that appears outside of
   a constraint is violated, the behavior is undefined. ..."

The fact that this rule applies only to a shall "that appears outside of a
constraint" seems pretty clear to me. Farther along in the same section, it
says:

And then it goes ahead and explicitly defines what "Constraints" means:
" * Constraints --- syntactic and semantic restrictions by which the
   exposition of language elements is to be interpreted."

Section 2.1.1.3: "A conforming implementation shall produce at least one
diagnostic message (identified in an implementation-defined manner) for every
translation unit that contains a violation of any syntax rule or constraint.
..."

There are 52 different sections in the C89 standard labelled "Constraints". One
of them is under section 3.1, and says the same thing that the current standard
says about this particular issue:

"Each preprocessing token that is converted to a token shall have
the lexical form of a keyword, an identifier, a constant, a string
literal, an operator, or a punctuator."

If you were under the impression that C89 had no concepts of Constraints, I can
only conclude that you've never bothered to carefully read the document.

[toc] | [prev] | [next] | [standalone]

#120438

From	Richard Damon <Richard@Damon-Family.org>
Date	2017-09-28 08:15 -0400
Message-ID	<GJ5zB.186245$OC1.129884@fx06.iad>
In reply to	#120418

On 9/27/17 11:33 PM, Pascal J. Bourguignon wrote:
> David Kleinecke <dkleinecke@gmail.com> writes:
> 
>> I am having trouble reading the standard on this one small
>> point. I will quote the C89 standard but I have checked the
>> C11 standard and seems to have exactly the same problem.
>>
>> The question is - exactly what does the standard mean by
>> "lexical form" and "lexical elements". There is no
>> definition withing the standard and (apart from a forward
>> reference in 5.1.1.2) they are only appear in 6.4 (6.1
>> in C89).
>>
>> There are expressions like "Each preprocessing token that is
>> converted to a token shall have the lexical form of a
>> keyword, an identifier, ...". This says each pp-token has
>> something called a "lexical form" which seems to have five
>> values (six in C89). The pp-tokens in my preprocessor are
>> unsigned ints (and are unchanged as tokens.
> 
> n1124.pdf mentions seven pp-tokens:
> 
>      header-name
>      identifier
>      pp-number
>      character-constant
>      string-literal
>      punctuator
>      non-white-space that cannot be one of the above
> 
> Assume a pre-processor/compiler that processes UTF-8 sources.  So you
> can have a string such as "Cet été était chaud."  However, été still
> cannot be an identifier. (You would have to write it as \u00e9t\u00e9).
> 
> So été is a pp-token  (non-white-space that cannot be one of the above),
> that is not an acceptable lexical form to be converted into a (C) token.
> 

Small correction, 6.4.2.1 lists for the definition of 
identifier-nondigit (the characters an identifier is allowed to begin 
with)):

nondigit (the letters and _)
universal-character-name
other implementation defined characters

because of the last term, an implementation is allowed to define that é 
is part if the implementation defined character set for non-digits, and 
thus été is possible a valid identifier (subject to the above 
implementation defined behavior). There is perhaps a slight suggestion 
that an implementation that accepts wide characters (i.e UTF-8 source 
files) allow the UTF-8 codes that are the equivalent of the values 
listed in Annex D as identifier characters.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Lexical Elements

Contents

#120416 — Lexical Elements

#120418

#120420

#120422

#120446

#120476

#120480

#120504

#120512

#120482

#120509

#120515

#120517

#120520

#120524

#120546

#120565

#120555

#120450

#120438