Groups | Search | Server Info | Keyboard shortcuts | Login | Register


Groups > comp.compilers > #2654

Re: Lexing Unicode strings?

Path csiph.com!tncsrv06.tnetconsulting.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From gah4 <gah4@u.washington.edu>
Newsgroups comp.compilers
Subject Re: Lexing Unicode strings?
Date Tue, 4 May 2021 01:11:51 -0700 (PDT)
Organization Compilers Central
Lines 54
Sender news@iecc.com
Approved comp.compilers@iecc.com
Message-ID <21-05-002@comp.compilers> (permalink)
References <21-05-001@comp.compilers>
Mime-Version 1.0
Content-Type text/plain; charset="UTF-8"
Content-Transfer-Encoding 8bit
Injection-Info gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48110"; mail-complaints-to="abuse@iecc.com"
Keywords lex, i18n, comment
Posted-Date 04 May 2021 11:33:47 EDT
X-submission-address compilers@iecc.com
X-moderator-address compilers-request@iecc.com
X-FAQ-and-archives http://compilers.iecc.com
In-Reply-To <21-05-001@comp.compilers>
Xref csiph.com comp.compilers:2654

Show key headers only | View raw


On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson
wrote:
> On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:


Snip.

> > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> > a sequence of bytes. UTF-8 was designed so that no character representation is a
> > prefix or suffix of any other character, so it should work without having to be clever
> > -John]

> That's not always feasible, nor the right approach. Let's consider the
> range of all lowercase greek letters. In the source file, that range
> will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
> not to match the bytes \xCE, \xB1..\xCF, and \x89.

Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so
hard.

> There is also the question of validating the input. It seems more
> natural to put the overlong sequence validator, and legal code point
> validator into the lexer, rather than preprocess the source file.

This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode.  One answer is, as John notes,
to apply it to the UTF-8 bytes.  But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches.  But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.

It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.

On the other hand, Java uses a 16 bit char, and one should be able to apply
Boyer-Moore just as well in that case.  The tables will be bigger, but then
so are computer memories today.

So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode
letter, followed by Unicode letters and digits.  Presumably, then, the
designers
of Java compilers have figured this out, I suspect using the 16 bit char.

One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

Back to comp.compilers | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
  Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
    Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
  Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400

csiph-web