Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > comp.compilers > #2653
| From | "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: Lexing Unicode strings? |
| Date | 2021-05-03 19:58 -0400 |
| Organization | Compilers Central |
| Message-ID | <21-05-001@comp.compilers> (permalink) |
On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote: > Dear c.compilers, > For context, I have been reading the old book Compiler design in C > by Allen Holub; available here > https://holub.com/compiler/ > and it goes into the details of the author's own LeX implementation. Snip. > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as > a sequence of bytes. UTF-8 was designed so that no character representation is a > prefix or suffix of any other character, so it should work without having to be clever > -John] That's not always feasible, nor the right approach. Let's consider the range of all lowercase greek letters. In the source file, that range will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is not to match the bytes \xCE, \xB1..\xCF, and \x89. There is also the question of validating the input. It seems more natural to put the overlong sequence validator, and legal code point validator into the lexer, rather than preprocess the source file. But maybe that's "premature optimization?" Utf-8 has more problems too. There are bytes that can only appear after other bytes, and detecting that early would be nice[0]; there is also the fact that some glyphs can be constructed with a single code point, or two. Also, Unicode input is not always utf-8. When Holub is constructing his state machines for the regular expressions, he uses bitmasks for the ranges[1]. ASCII, being 128 bits in a mask, is reasonable; even 256 bits for 8bit characters. 0x11000 bits in a mask feels excessive to me, though. I was maybe naive, when I thought Unicode was considered a "solved problem." I guess everyone is using either libraries like ICU or run time environments. These can have the following "problem:" Invalid input may not be signalled to the higher level code, but quietly replaced with the replacement symbol, U+FFFD. Which is, in my opinion, not good for a compiler. [0] This can be part of output lexer tables, and is harder to do manually when bootstrapping the lexer by hand. [1] His implementation of sets. -- Johann [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding is a prefix or suffix of any other UTF-8 encoding, you can lex them the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of a multi-byte sequence, so if they do, you can put a wildcard . at the end to match bogus bytes and complain about an invalid character. Dunno what you mean about not always UTF-8; I realize there are mislabeled files of UTF-16 that you have to sort out by sniffing the BOM at the front, but you do that and turn whatever you're getting into UTF-8 and then feed it to the lexer. I agree that lexing Unicode is not a solved problem, and I'm not aware of any really good ways to limit the table sizes. -John]
Back to comp.compilers | Previous | Next — Next in thread | Find similar
Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400
csiph-web