Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #2649

Lexing Unicode strings?

Path csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com>
Newsgroups comp.compilers
Subject Lexing Unicode strings?
Date Wed, 21 Apr 2021 16:20:40 +0000
Organization Compilers Central
Lines 30
Sender news@iecc.com
Approved comp.compilers@iecc.com
Message-ID <21-04-010@comp.compilers> (permalink)
Mime-Version 1.0
Content-Type text/plain; charset="UTF-8"
Injection-Info gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="7784"; mail-complaints-to="abuse@iecc.com"
Keywords lex, i18n, question
Posted-Date 21 Apr 2021 12:38:24 EDT
X-submission-address compilers@iecc.com
X-moderator-address compilers-request@iecc.com
X-FAQ-and-archives http://compilers.iecc.com
Xref csiph.com comp.compilers:2649

Show key headers only | View raw


Dear c.compilers,

For context, I have been reading the old book Compiler design in C
by Allen Holub; available here

https://holub.com/compiler/

and it goes into the details of the author's own LeX implementation.

Just like the dragon book [which I admit I haven't read for some number
of years] this uses lookup tables for the individual characters, which
is fine for ASCII, but does kind of seem excessive for all 0x10ffff code
points in Unicode.

I am interested in this, using plain old C, without using external tools
like ICU, for my own reasons[1].  What data structures are appropriate
for this exercise?  Are there resources out there I can study, other
than the ICU source code?  [Which for other reasons of my own, I'm not
too keen on studying.]

[1] Let's leave out the question if I'll be successful or not.


Thanks,
--
Johann
[The obvious approach if you're scaning UTF-8 text is to keep treating the input as
a sequence of bytes.  UTF-8 was designed so that no character representation is a prefix or suffix
of any other character, so it should work without having to be clever. -John]

Back to comp.compilers | Previous | Next | Find similar


Thread

Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-04-21 16:20 +0000

csiph-web