Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > comp.compilers > #2654
| Path | csiph.com!tncsrv06.tnetconsulting.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end |
|---|---|
| From | gah4 <gah4@u.washington.edu> |
| Newsgroups | comp.compilers |
| Subject | Re: Lexing Unicode strings? |
| Date | Tue, 4 May 2021 01:11:51 -0700 (PDT) |
| Organization | Compilers Central |
| Lines | 54 |
| Sender | news@iecc.com |
| Approved | comp.compilers@iecc.com |
| Message-ID | <21-05-002@comp.compilers> (permalink) |
| References | <21-05-001@comp.compilers> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="UTF-8" |
| Content-Transfer-Encoding | 8bit |
| Injection-Info | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48110"; mail-complaints-to="abuse@iecc.com" |
| Keywords | lex, i18n, comment |
| Posted-Date | 04 May 2021 11:33:47 EDT |
| X-submission-address | compilers@iecc.com |
| X-moderator-address | compilers-request@iecc.com |
| X-FAQ-and-archives | http://compilers.iecc.com |
| In-Reply-To | <21-05-001@comp.compilers> |
| Xref | csiph.com comp.compilers:2654 |
Show key headers only | View raw
On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote: > On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote: Snip. > > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as > > a sequence of bytes. UTF-8 was designed so that no character representation is a > > prefix or suffix of any other character, so it should work without having to be clever > > -John] > That's not always feasible, nor the right approach. Let's consider the > range of all lowercase greek letters. In the source file, that range > will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is > not to match the bytes \xCE, \xB1..\xCF, and \x89. Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode. You needs some Boolean logic in your matching, though that shouldn't be so hard. > There is also the question of validating the input. It seems more > natural to put the overlong sequence validator, and legal code point > validator into the lexer, rather than preprocess the source file. This reminds me of the question, which I don't remember where I saw, about applying Boyer-Moore to Unicode. One answer is, as John notes, to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore is to use the unequal probability distribution of characters, to quickly skip over impossible matches. But the bytes of UTF-8 don't have the same (im)probabiity of the individual characters. It seems to me that you lose some of the speed advantage of Boyer-Moore, but maybe it is still plenty fast enough. On the other hand, Java uses a 16 bit char, and one should be able to apply Boyer-Moore just as well in that case. The tables will be bigger, but then so are computer memories today. So, as with many problems, there are trade-offs between speed and size, and one has to choose the best case for the specific problem. Note that in addition to have a 16 bit Unicode char, the Java language itself is defined in terms of Unicode. Variable names can be any Unicode letter, followed by Unicode letters and digits. Presumably, then, the designers of Java compilers have figured this out, I suspect using the 16 bit char. One can, for example, have variables named A and Α in the same program. (In case you can't see it, the second one is an Alpha.) Yes, Unicode can be fun! [Remember that Unicode is a 20 bit code and for characters outside the first 64K, Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400
csiph-web