Path: csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: "Johann 'Myrkraverk' Oskarsson" Newsgroups: comp.compilers Subject: Lexing Unicode strings? Date: Wed, 21 Apr 2021 16:20:40 +0000 Organization: Compilers Central Lines: 30 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <21-04-010@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="7784"; mail-complaints-to="abuse@iecc.com" Keywords: lex, i18n, question Posted-Date: 21 Apr 2021 12:38:24 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2649 Dear c.compilers, For context, I have been reading the old book Compiler design in C by Allen Holub; available here https://holub.com/compiler/ and it goes into the details of the author's own LeX implementation. Just like the dragon book [which I admit I haven't read for some number of years] this uses lookup tables for the individual characters, which is fine for ASCII, but does kind of seem excessive for all 0x10ffff code points in Unicode. I am interested in this, using plain old C, without using external tools like ICU, for my own reasons[1]. What data structures are appropriate for this exercise? Are there resources out there I can study, other than the ICU source code? [Which for other reasons of my own, I'm not too keen on studying.] [1] Let's leave out the question if I'll be successful or not. Thanks, -- Johann [The obvious approach if you're scaning UTF-8 text is to keep treating the input as a sequence of bytes. UTF-8 was designed so that no character representation is a prefix or suffix of any other character, so it should work without having to be clever. -John]