Path: csiph.com!weretis.net!feeder8.news.weretis.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith Thompson <Keith.S.Thompson+u@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: C vs Haskell for XML parsing
Date: Wed, 30 Aug 2023 11:42:40 -0700
Organization: None to speak of
Lines: 77
Message-ID: <87o7io9xsv.fsf@nosuchdomain.example.com>
References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com> <ucbjph$96fa$1@dont-email.me> <ipdGM.457101$xMqa.238959@fx12.iad> <uccitk$hhuj$1@dont-email.me> <H2oGM.827787$TPw2.680260@fx17.iad> <ucd5kt$kl1p$1@dont-email.me> <20230826123929.770@kylheku.com> <ucdp4i$ot46$1@dont-email.me> <20230826210521.20@kylheku.com> <ucf86t$14b70$1@dont-email.me> <20230827151627.814@kylheku.com> <87edjocbqj.fsf@nosuchdomain.example.com> <86edjnxo81.fsf@linuxsc.com> <87ledubyeh.fsf@nosuchdomain.example.com> <861qfmwwvy.fsf@linuxsc.com> <ucjei2$1tv7s$1@dont-email.me> <20230828182115.305@kylheku.com> <uckeel$26q6m$1@dont-email.me> <uckumo$29oe0$1@dont-email.me> <875y4xboly.fsf@nosuchdomain.example.com> <ucn6hi$2n2kb$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="c1c8ce4c37b1e802642b7baee24e084b"; logging-data="2996058"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18rzmLVcBLNxt44vNrArTZj"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:pucSG4FPDFXin13wHk7DZEJg8G0= sha1:yvC22pZpsazuFqfxNJkGR4Je9Us=
Xref: csiph.com comp.lang.c:173337

David Brown <david.brown@hesbynett.no> writes:
> On 29/08/2023 22:06, Keith Thompson wrote:
>> David Brown <david.brown@hesbynett.no> writes:
>> [...]
>>> Being able to accept $ in identifiers is a convenient extension.
>> Quibble: $ in identifiers is not an extension as specified in
>> section 4
>> of the standard.  Starting in C99, the set of characters accepted in
>> identifiers is implementation-defined.  (I'm not sure what difference
>> that makes.)
>
> Until it was mentioned in this thread, I didn't realise that dollars
> (or other characters) in identifiers were implementation-defined
> options, rather than requiring it to be an extension.  (The
> distinction I would make here is that an "extension" is for something
> that the standard does not cover at all but is documented by the
> compiler - something that would be a syntax error, a constraint
> violation, or undefined behaviour if the compiler did not support it.)
> As you say, I don't think there is a significant difference in
> practice, but I like to understand these things as accurately as I
> can, and I appreciate the bug-fix when I get them wrong.
>
> On a related note, I am looking at 6.4.2.1p3 (in C11) on "universal
> character names", and it appears to say that the characters must come 
> from the list in D.1 (but not D.2, for the initial character), but
> also that implementations can allow other characters.  This would make
> the D.1 list the minimal set allowed, but implementations could allow
> any (or almost any) other Unicode characters.  Do you think that is
> right, or am I missing something?

C23 reworks the definition of "identifier".  See N3096 6.4.2.1.
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf

An identifier may start with any of:
    nondigit
    XID_Start character
    universal-character-name of class XID_Start
followed by zero or more of:
    digit
    nondigit
    XID_Continue character
    universal-character-name of class XID_Continue
where "digit" is one of 0..9 and "nondigit" is underscore or one of the
52 uppercase and lowercase Latin letters.

XID_Start and XID_Continue are specified in "UAX #44"
https://unicode.org/reports/tr44/
I haven't yet read enough of it to figure out what those mean.
Annex D has more information.

Under Semantics, N3096 says:
    An XID_Start character is an implementation-defined character whose
    corresponding code point in ISO/IEC 10646 has the XID_Start
    property. An XID_Continue character is an implementation-defined
    character whose corresponding code point in ISO/IEC 10646 has the
    XID_Continue property.

My understanding is that an implementation will choose and document some
subset of the XID_Start and XID_Continue characters to be valid in
identifiers.

One odd thing (in both N1570 and N3096) is that the Semantics subsection
uses "shall".  For example, N1570 6.4.2.1p3 says:

    Each universal character name in an identifier shall designate a
    character whose encoding in ISO/IEC 10646 falls into one of the
    ranges specified in D.1. The initial character shall not be a
    universal character name designating a character whose encoding
    falls into one of the ranges specified in D.2.

This implies that a violation of such a requirement has undefined
behavior. I would have expected it to be a syntax error.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */