Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2653 > unrolled thread
| Started by | "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> |
|---|---|
| First post | 2021-05-03 19:58 -0400 |
| Last post | 2021-07-14 15:39 -0400 |
| Articles | 4 — 3 participants |
Back to article view | Back to comp.compilers
Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400
| From | "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> |
|---|---|
| Date | 2021-05-03 19:58 -0400 |
| Subject | Re: Lexing Unicode strings? |
| Message-ID | <21-05-001@comp.compilers> |
On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote: > Dear c.compilers, > For context, I have been reading the old book Compiler design in C > by Allen Holub; available here > https://holub.com/compiler/ > and it goes into the details of the author's own LeX implementation. Snip. > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as > a sequence of bytes. UTF-8 was designed so that no character representation is a > prefix or suffix of any other character, so it should work without having to be clever > -John] That's not always feasible, nor the right approach. Let's consider the range of all lowercase greek letters. In the source file, that range will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is not to match the bytes \xCE, \xB1..\xCF, and \x89. There is also the question of validating the input. It seems more natural to put the overlong sequence validator, and legal code point validator into the lexer, rather than preprocess the source file. But maybe that's "premature optimization?" Utf-8 has more problems too. There are bytes that can only appear after other bytes, and detecting that early would be nice[0]; there is also the fact that some glyphs can be constructed with a single code point, or two. Also, Unicode input is not always utf-8. When Holub is constructing his state machines for the regular expressions, he uses bitmasks for the ranges[1]. ASCII, being 128 bits in a mask, is reasonable; even 256 bits for 8bit characters. 0x11000 bits in a mask feels excessive to me, though. I was maybe naive, when I thought Unicode was considered a "solved problem." I guess everyone is using either libraries like ICU or run time environments. These can have the following "problem:" Invalid input may not be signalled to the higher level code, but quietly replaced with the replacement symbol, U+FFFD. Which is, in my opinion, not good for a compiler. [0] This can be part of output lexer tables, and is harder to do manually when bootstrapping the lexer by hand. [1] His implementation of sets. -- Johann [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding is a prefix or suffix of any other UTF-8 encoding, you can lex them the same way you'd lex strings of ASCII. In that example above, \xCE, \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of a multi-byte sequence, so if they do, you can put a wildcard . at the end to match bogus bytes and complain about an invalid character. Dunno what you mean about not always UTF-8; I realize there are mislabeled files of UTF-16 that you have to sort out by sniffing the BOM at the front, but you do that and turn whatever you're getting into UTF-8 and then feed it to the lexer. I agree that lexing Unicode is not a solved problem, and I'm not aware of any really good ways to limit the table sizes. -John]
[toc] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2021-05-04 01:11 -0700 |
| Message-ID | <21-05-002@comp.compilers> |
| In reply to | #2653 |
On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote: > On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote: Snip. > > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as > > a sequence of bytes. UTF-8 was designed so that no character representation is a > > prefix or suffix of any other character, so it should work without having to be clever > > -John] > That's not always feasible, nor the right approach. Let's consider the > range of all lowercase greek letters. In the source file, that range > will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is > not to match the bytes \xCE, \xB1..\xCF, and \x89. Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode. You needs some Boolean logic in your matching, though that shouldn't be so hard. > There is also the question of validating the input. It seems more > natural to put the overlong sequence validator, and legal code point > validator into the lexer, rather than preprocess the source file. This reminds me of the question, which I don't remember where I saw, about applying Boyer-Moore to Unicode. One answer is, as John notes, to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore is to use the unequal probability distribution of characters, to quickly skip over impossible matches. But the bytes of UTF-8 don't have the same (im)probabiity of the individual characters. It seems to me that you lose some of the speed advantage of Boyer-Moore, but maybe it is still plenty fast enough. On the other hand, Java uses a 16 bit char, and one should be able to apply Boyer-Moore just as well in that case. The tables will be bigger, but then so are computer memories today. So, as with many problems, there are trade-offs between speed and size, and one has to choose the best case for the specific problem. Note that in addition to have a 16 bit Unicode char, the Java language itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers of Java compilers have figured this out, I suspect using the 16 bit char. One can, for example, have variables named A and Α in the same program. (In case you can't see it, the second one is an Alpha.) Yes, Unicode can be fun! [Remember that Unicode is a 20 bit code and for characters outside the first 64K, Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]
[toc] | [prev] | [next] | [standalone]
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Date | 2021-05-04 14:47 -0700 |
| Message-ID | <21-05-004@comp.compilers> |
| In reply to | #2654 |
On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote: (snip, I wrote) > Note that in addition to have a 16 bit Unicode char, the Java language > itself is defined in terms of Unicode. Variable names can be any Unicode > letter, followed by Unicode letters and digits. Presumably, then, the > designers of Java compilers have figured this out, I suspect using the 16 bit char. (snip) > Yes, Unicode can be fun! > [Remember that Unicode is a 20 bit code and for characters outside the first 64K, > Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John] I did know that Java used 16 bits, but never tried to figure out what they did with the rest of the characters. There should be enough in the first 64K for writing programs. I did once use π for a variable name, with the obvious value. It seems it is \u03c0. I even found an editor that allowed entering such characters, and then would write out the file with \u escapes. As far as I know, that is more usual than UTF-8. I believe that the Java parser converts from \u escapes fairly early, such that you can quote strings with \u0022, and then you should be able to put \uu0022 inside the strings. [If you're only going to allow the lower 64K, your users will be sad when they try to use quoted strings with uncommon Chinese characters or with emoji, or more likely your compiler will barf since they will be encoded as two surrogate characters and your lexer won't know what to do with them. If you're going to deal with Unicode, better bite the bullet and deal with the whole mess. -John]
[toc] | [prev] | [next] | [standalone]
| From | Hans Aberg <haberg-news@telia.com> |
|---|---|
| Date | 2021-07-14 15:39 -0400 |
| Message-ID | <21-07-002@comp.compilers> |
| In reply to | #2653 |
On 2021-05-04 01:58, John Levine wrote: > [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding > is a prefix or suffix of any other UTF-8 encoding, you can lex them > the same way you'd lex strings of ASCII. In that example above, \xCE, > \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of > a multi-byte sequence, so if they do, you can put a wildcard . at the > end to match bogus bytes and complain about an invalid character. Dunno > what you mean about not always UTF-8; I realize there are mislabeled > files of UTF-16 that you have to sort out by sniffing the BOM at the > front, but you do that and turn whatever you're getting into UTF-8 > and then feed it to the lexer. > > I agree that lexing Unicode is not a solved problem, and I'm not > aware of any really good ways to limit the table sizes. -John] I wrote code, in Haskell and C++, that translates Unicode character classes into byte classes. From a theoretical standpoint, a Unicode regular language mapped under UTF-8 is a byte regular language, so it is possible. So the 2^8 = 256 size tables that Flex uses is enough. The Flex manual has an example how to make a regular expression replacing its dot '.' to pick up all legal UTF-8 bytes.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web