Path: csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Christopher F Clark Newsgroups: comp.compilers Subject: Re: Lexing Unicode strings? Date: Tue, 4 May 2021 14:39:54 +0300 Organization: Compilers Central Lines: 37 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <21-05-003@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48858"; mail-complaints-to="abuse@iecc.com" Keywords: lex, i18n Posted-Date: 04 May 2021 11:34:37 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2655 I don't have much to personally add on this topic. However, if you are considering how to compress lexer tables indexed by Unicode code points, I would recommend you look at this paper: https://dl.acm.org/doi/10.1145/1780.1802 by Dencker, Duerre, and Heuft on Optimization of Parsing Tables for Portable Compilers. They investigated the main techniques for compressing said tables, with particular interest in ways of using "coloring" (assigning multiple entries to the same location by indexing into a color table). From my experience with lexing Unicode (which is admittedly quite limited), most grammars have long sequential sets of unicode code points that are all in the same set, e.g. they are alphabetic characters, digitis, operators, punctuation,etc,or disallowed and that once grouped into those sets, the actual lexing tables are mostly compact. Now, that doesn't necessarily help you map your UTF-8 (et al) into those sets, although my guess is that it is simpler than it seems as there is regularity there And, if you look at the techniques the authors presented, you can combine one or two of them and get a method that is relatively both space and time efficient. If you do some experiments and want review, critique, suggestions, you may either post here or email me directly. I would be interested in a space and time efficient solution myself as I intend to make my next lexer generator for Yacc++ unicode aware and perhaps even unicode centric. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris