Re: Lexing Unicode strings?

From	Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups	comp.compilers
Subject	Re: Lexing Unicode strings?
Date	2021-05-04 14:39 +0300
Organization	Compilers Central
Message-ID	<21-05-003@comp.compilers> (permalink)

Show all headers | View raw

I don't have much to personally add on this topic.

However, if you are considering how to compress lexer tables indexed
by Unicode code points, I would recommend you look at this paper:
https://dl.acm.org/doi/10.1145/1780.1802
by Dencker, Duerre, and Heuft
on Optimization of Parsing Tables for Portable Compilers.

They investigated the main techniques for compressing said tables,
with particular interest in ways of using "coloring" (assigning
multiple entries to the same location by indexing into a color table).
From my experience with lexing Unicode (which is admittedly quite
limited), most grammars have long sequential sets of unicode code
points that are all in the same set, e.g. they are alphabetic
characters, digitis, operators, punctuation,etc,or disallowed and that
once grouped into those sets, the actual lexing tables are mostly
compact.

Now, that doesn't necessarily help you map your UTF-8 (et al) into
those sets, although my guess is that it is simpler than it seems as
there is regularity there

And, if you look at the techniques the authors presented, you can
combine one or two of them and get a method that is relatively both
space and time efficient.

If you do some experiments and want review, critique, suggestions, you
may either post here or email me directly.  I would be interested in a
space and time efficient solution myself as I intend to make my next
lexer generator for Yacc++ unicode aware and perhaps even unicode
centric.
--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris

Back to comp.compilers | Previous | Next | Find similar

Thread

Re: Lexing Unicode strings? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2021-05-04 14:39 +0300

csiph-web