Path: csiph.com!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Hans Aberg Newsgroups: comp.compilers Subject: Re: Lexing Unicode strings? Date: Wed, 14 Jul 2021 15:39:25 -0400 (EDT) Organization: A noiseless patient Spider Lines: 21 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <21-07-002@comp.compilers> References: <21-05-001@comp.compilers> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="96033"; mail-complaints-to="abuse@iecc.com" Keywords: lex, i18n Posted-Date: 14 Jul 2021 15:39:25 EDT X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com In-Reply-To: <21-05-001@comp.compilers> Content-Language: en-US Xref: csiph.com comp.compilers:2677 On 2021-05-04 01:58, John Levine wrote: > [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding > is a prefix or suffix of any other UTF-8 encoding, you can lex them > the same way you'd lex strings of ASCII. In that example above, \xCE, > \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of > a multi-byte sequence, so if they do, you can put a wildcard . at the > end to match bogus bytes and complain about an invalid character. Dunno > what you mean about not always UTF-8; I realize there are mislabeled > files of UTF-16 that you have to sort out by sniffing the BOM at the > front, but you do that and turn whatever you're getting into UTF-8 > and then feed it to the lexer. > > I agree that lexing Unicode is not a solved problem, and I'm not > aware of any really good ways to limit the table sizes. -John] I wrote code, in Haskell and C++, that translates Unicode character classes into byte classes. From a theoretical standpoint, a Unicode regular language mapped under UTF-8 is a byte regular language, so it is possible. So the 2^8 = 256 size tables that Flex uses is enough. The Flex manual has an example how to make a regular expression replacing its dot '.' to pick up all legal UTF-8 bytes.