Groups | Search | Server Info | Keyboard shortcuts | Login | Register
Groups > comp.compilers > #2655
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: Lexing Unicode strings? |
| Date | 2021-05-04 14:39 +0300 |
| Organization | Compilers Central |
| Message-ID | <21-05-003@comp.compilers> (permalink) |
I don't have much to personally add on this topic. However, if you are considering how to compress lexer tables indexed by Unicode code points, I would recommend you look at this paper: https://dl.acm.org/doi/10.1145/1780.1802 by Dencker, Duerre, and Heuft on Optimization of Parsing Tables for Portable Compilers. They investigated the main techniques for compressing said tables, with particular interest in ways of using "coloring" (assigning multiple entries to the same location by indexing into a color table). From my experience with lexing Unicode (which is admittedly quite limited), most grammars have long sequential sets of unicode code points that are all in the same set, e.g. they are alphabetic characters, digitis, operators, punctuation,etc,or disallowed and that once grouped into those sets, the actual lexing tables are mostly compact. Now, that doesn't necessarily help you map your UTF-8 (et al) into those sets, although my guess is that it is simpler than it seems as there is regularity there And, if you look at the techniques the authors presented, you can combine one or two of them and get a method that is relatively both space and time efficient. If you do some experiments and want review, critique, suggestions, you may either post here or email me directly. I would be interested in a space and time efficient solution myself as I intend to make my next lexer generator for Yacc++ unicode aware and perhaps even unicode centric. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
Back to comp.compilers | Previous | Next | Find similar
Re: Lexing Unicode strings? Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2021-05-04 14:39 +0300
csiph-web