Path: csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Subject: Re: Lexing Unicode strings?
Date: Tue, 4 May 2021 14:39:54 +0300
Organization: Compilers Central
Lines: 37
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <21-05-003@comp.compilers>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48858"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, i18n
Posted-Date: 04 May 2021 11:34:37 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2655

I don't have much to personally add on this topic.

However, if you are considering how to compress lexer tables indexed
by Unicode code points, I would recommend you look at this paper:
https://dl.acm.org/doi/10.1145/1780.1802
by Dencker, Duerre, and Heuft
on Optimization of Parsing Tables for Portable Compilers.

They investigated the main techniques for compressing said tables,
with particular interest in ways of using "coloring" (assigning
multiple entries to the same location by indexing into a color table).
From my experience with lexing Unicode (which is admittedly quite
limited), most grammars have long sequential sets of unicode code
points that are all in the same set, e.g. they are alphabetic
characters, digitis, operators, punctuation,etc,or disallowed and that
once grouped into those sets, the actual lexing tables are mostly
compact.

Now, that doesn't necessarily help you map your UTF-8 (et al) into
those sets, although my guess is that it is simpler than it seems as
there is regularity there

And, if you look at the techniques the authors presented, you can
combine one or two of them and get a method that is relatively both
space and time efficient.

If you do some experiments and want review, critique, suggestions, you
may either post here or email me directly.  I would be interested in a
space and time efficient solution myself as I intend to make my next
lexer generator for Yacc++ unicode aware and perhaps even unicode
centric.
--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris