Re: Lexing Unicode strings?

Path	csiph.com!tncsrv06.tnetconsulting.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From	gah4 <gah4@u.washington.edu>
Newsgroups	comp.compilers
Subject	Re: Lexing Unicode strings?
Date	Tue, 4 May 2021 01:11:51 -0700 (PDT)
Organization	Compilers Central
Lines	54
Sender	news@iecc.com
Approved	comp.compilers@iecc.com
Message-ID	<21-05-002@comp.compilers> (permalink)
References	<21-05-001@comp.compilers>
Mime-Version	1.0
Content-Type	text/plain; charset="UTF-8"
Content-Transfer-Encoding	8bit
Injection-Info	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48110"; mail-complaints-to="abuse@iecc.com"
Keywords	lex, i18n, comment
Posted-Date	04 May 2021 11:33:47 EDT
X-submission-address	compilers@iecc.com
X-moderator-address	compilers-request@iecc.com
X-FAQ-and-archives	http://compilers.iecc.com
In-Reply-To	<21-05-001@comp.compilers>
Xref	csiph.com comp.compilers:2654

Show key headers only | View raw

On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson
wrote:
> On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

Snip.

> > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> > a sequence of bytes. UTF-8 was designed so that no character representation is a
> > prefix or suffix of any other character, so it should work without having to be clever
> > -John]

> That's not always feasible, nor the right approach. Let's consider the
> range of all lowercase greek letters. In the source file, that range
> will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
> not to match the bytes \xCE, \xB1..\xCF, and \x89.

Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so
hard.

> There is also the question of validating the input. It seems more
> natural to put the overlong sequence validator, and legal code point
> validator into the lexer, rather than preprocess the source file.

This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode.  One answer is, as John notes,
to apply it to the UTF-8 bytes.  But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches.  But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.

It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.

On the other hand, Java uses a 16 bit char, and one should be able to apply
Boyer-Moore just as well in that case.  The tables will be bigger, but then
so are computer memories today.

So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode
letter, followed by Unicode letters and digits.  Presumably, then, the
designers
of Java compilers have figured this out, I suspect using the 16 bit char.

One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
  Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
    Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
  Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400

csiph-web