Path: csiph.com!weretis.net!feeder8.news.weretis.net!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: C vs Haskell for XML parsing Date: Wed, 30 Aug 2023 11:42:40 -0700 Organization: None to speak of Lines: 77 Message-ID: <87o7io9xsv.fsf@nosuchdomain.example.com> References: <576801fa-2842-40dc-bf19-221a5b1cf660n@googlegroups.com> <20230826123929.770@kylheku.com> <20230826210521.20@kylheku.com> <20230827151627.814@kylheku.com> <87edjocbqj.fsf@nosuchdomain.example.com> <86edjnxo81.fsf@linuxsc.com> <87ledubyeh.fsf@nosuchdomain.example.com> <861qfmwwvy.fsf@linuxsc.com> <20230828182115.305@kylheku.com> <875y4xboly.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain Injection-Info: dont-email.me; posting-host="c1c8ce4c37b1e802642b7baee24e084b"; logging-data="2996058"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rzmLVcBLNxt44vNrArTZj" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) Cancel-Lock: sha1:pucSG4FPDFXin13wHk7DZEJg8G0= sha1:yvC22pZpsazuFqfxNJkGR4Je9Us= Xref: csiph.com comp.lang.c:173337 David Brown writes: > On 29/08/2023 22:06, Keith Thompson wrote: >> David Brown writes: >> [...] >>> Being able to accept $ in identifiers is a convenient extension. >> Quibble: $ in identifiers is not an extension as specified in >> section 4 >> of the standard. Starting in C99, the set of characters accepted in >> identifiers is implementation-defined. (I'm not sure what difference >> that makes.) > > Until it was mentioned in this thread, I didn't realise that dollars > (or other characters) in identifiers were implementation-defined > options, rather than requiring it to be an extension. (The > distinction I would make here is that an "extension" is for something > that the standard does not cover at all but is documented by the > compiler - something that would be a syntax error, a constraint > violation, or undefined behaviour if the compiler did not support it.) > As you say, I don't think there is a significant difference in > practice, but I like to understand these things as accurately as I > can, and I appreciate the bug-fix when I get them wrong. > > On a related note, I am looking at 6.4.2.1p3 (in C11) on "universal > character names", and it appears to say that the characters must come > from the list in D.1 (but not D.2, for the initial character), but > also that implementations can allow other characters. This would make > the D.1 list the minimal set allowed, but implementations could allow > any (or almost any) other Unicode characters. Do you think that is > right, or am I missing something? C23 reworks the definition of "identifier". See N3096 6.4.2.1. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf An identifier may start with any of: nondigit XID_Start character universal-character-name of class XID_Start followed by zero or more of: digit nondigit XID_Continue character universal-character-name of class XID_Continue where "digit" is one of 0..9 and "nondigit" is underscore or one of the 52 uppercase and lowercase Latin letters. XID_Start and XID_Continue are specified in "UAX #44" https://unicode.org/reports/tr44/ I haven't yet read enough of it to figure out what those mean. Annex D has more information. Under Semantics, N3096 says: An XID_Start character is an implementation-defined character whose corresponding code point in ISO/IEC 10646 has the XID_Start property. An XID_Continue character is an implementation-defined character whose corresponding code point in ISO/IEC 10646 has the XID_Continue property. My understanding is that an implementation will choose and document some subset of the XID_Start and XID_Continue characters to be valid in identifiers. One odd thing (in both N1570 and N3096) is that the Semantics subsection uses "shall". For example, N1570 6.4.2.1p3 says: Each universal character name in an identifier shall designate a character whose encoding in ISO/IEC 10646 falls into one of the ranges specified in D.1. The initial character shall not be a universal character name designating a character whose encoding falls into one of the ranges specified in D.2. This implies that a violation of such a requirement has undefined behavior. I would have expected it to be a syntax error. -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */