Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2649
| Path | csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end |
|---|---|
| From | "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> |
| Newsgroups | comp.compilers |
| Subject | Lexing Unicode strings? |
| Date | Wed, 21 Apr 2021 16:20:40 +0000 |
| Organization | Compilers Central |
| Lines | 30 |
| Sender | news@iecc.com |
| Approved | comp.compilers@iecc.com |
| Message-ID | <21-04-010@comp.compilers> (permalink) |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="UTF-8" |
| Injection-Info | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="7784"; mail-complaints-to="abuse@iecc.com" |
| Keywords | lex, i18n, question |
| Posted-Date | 21 Apr 2021 12:38:24 EDT |
| X-submission-address | compilers@iecc.com |
| X-moderator-address | compilers-request@iecc.com |
| X-FAQ-and-archives | http://compilers.iecc.com |
| Xref | csiph.com comp.compilers:2649 |
Show key headers only | View raw
Dear c.compilers, For context, I have been reading the old book Compiler design in C by Allen Holub; available here https://holub.com/compiler/ and it goes into the details of the author's own LeX implementation. Just like the dragon book [which I admit I haven't read for some number of years] this uses lookup tables for the individual characters, which is fine for ASCII, but does kind of seem excessive for all 0x10ffff code points in Unicode. I am interested in this, using plain old C, without using external tools like ICU, for my own reasons[1]. What data structures are appropriate for this exercise? Are there resources out there I can study, other than the ICU source code? [Which for other reasons of my own, I'm not too keen on studying.] [1] Let's leave out the question if I'll be successful or not. Thanks, -- Johann [The obvious approach if you're scaning UTF-8 text is to keep treating the input as a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix of any other character, so it should work without having to be clever. -John]
Back to comp.compilers | Previous | Next | Find similar
Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-04-21 16:20 +0000
csiph-web