Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2649
| From | "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Lexing Unicode strings? |
| Date | 2021-04-21 16:20 +0000 |
| Organization | Compilers Central |
| Message-ID | <21-04-010@comp.compilers> (permalink) |
Dear c.compilers, For context, I have been reading the old book Compiler design in C by Allen Holub; available here https://holub.com/compiler/ and it goes into the details of the author's own LeX implementation. Just like the dragon book [which I admit I haven't read for some number of years] this uses lookup tables for the individual characters, which is fine for ASCII, but does kind of seem excessive for all 0x10ffff code points in Unicode. I am interested in this, using plain old C, without using external tools like ICU, for my own reasons[1]. What data structures are appropriate for this exercise? Are there resources out there I can study, other than the ICU source code? [Which for other reasons of my own, I'm not too keen on studying.] [1] Let's leave out the question if I'll be successful or not. Thanks, -- Johann [The obvious approach if you're scaning UTF-8 text is to keep treating the input as a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix of any other character, so it should work without having to be clever. -John]
Back to comp.compilers | Previous | Next | Find similar
Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-04-21 16:20 +0000
csiph-web