Groups > comp.compilers > #2653 > unrolled thread

Re: Lexing Unicode strings?

Started by	"Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com>
First post	2021-05-03 19:58 -0400
Last post	2021-07-14 15:39 -0400
Articles	4 — 3 participants

Back to article view | Back to comp.compilers

  Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
    Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
      Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
    Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400

#2653 — Re: Lexing Unicode strings?

From	"Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com>
Date	2021-05-03 19:58 -0400
Subject	Re: Lexing Unicode strings?
Message-ID	<21-05-001@comp.compilers>

On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

> Dear c.compilers,

> For context, I have been reading the old book Compiler design in C
> by Allen Holub; available here

> https://holub.com/compiler/

> and it goes into the details of the author's own LeX implementation.

Snip.

> [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> a sequence of bytes.  UTF-8 was designed so that no character representation is a
> prefix or suffix of any other character, so it should work without having to be clever
> -John]

That's not always feasible, nor the right approach.  Let's consider the
range of all lowercase greek letters.  In the source file, that range
will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
not to match the bytes \xCE, \xB1..\xCF, and \x89.

There is also the question of validating the input.  It seems more
natural to put the overlong sequence validator, and legal code point
validator into the lexer, rather than preprocess the source file.

But maybe that's "premature optimization?"

Utf-8 has more problems too.  There are bytes that can only appear after
other bytes, and detecting that early would be nice[0]; there is also the fact
that some glyphs can be constructed with a single code point, or two.

Also, Unicode input is not always utf-8.

When Holub is constructing his state machines for the regular
expressions, he uses bitmasks for the ranges[1].  ASCII, being 128
bits in a mask, is reasonable; even 256 bits for 8bit characters.

0x11000 bits in a mask feels excessive to me, though.

I was maybe naive, when I thought Unicode was considered a
"solved problem."  I guess everyone is using either libraries like
ICU or run time environments.  These can have the following
"problem:"  Invalid input may not be signalled to the higher level
code, but quietly replaced with the replacement symbol, U+FFFD.
Which is, in my opinion, not good for a compiler.

[0] This can be part of output lexer tables, and is harder to
do manually when bootstrapping the lexer by hand.

[1] His implementation of sets.
--
Johann
[I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
is a prefix or suffix of any other UTF-8 encoding, you can lex them
the same way you'd lex strings of ASCII. In that example above, \xCE,
\xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
a multi-byte sequence, so if they do, you can put a wildcard . at the
end to match bogus bytes and complain about an invalid character.  Dunno
what you mean about not always UTF-8; I realize there are mislabeled
files of UTF-16 that you have to sort out by sniffing the BOM at the
front, but you do that and turn whatever you're getting into UTF-8
and then feed it to the lexer.

I agree that lexing Unicode is not a solved problem, and I'm not
aware of any really good ways to limit the table sizes. -John]

[toc] | [next] | [standalone]

#2654

From	gah4 <gah4@u.washington.edu>
Date	2021-05-04 01:11 -0700
Message-ID	<21-05-002@comp.compilers>
In reply to	#2653

On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson
wrote:
> On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:

Snip.

> > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> > a sequence of bytes. UTF-8 was designed so that no character representation is a
> > prefix or suffix of any other character, so it should work without having to be clever
> > -John]

> That's not always feasible, nor the right approach. Let's consider the
> range of all lowercase greek letters. In the source file, that range
> will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
> not to match the bytes \xCE, \xB1..\xCF, and \x89.

Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so
hard.

> There is also the question of validating the input. It seems more
> natural to put the overlong sequence validator, and legal code point
> validator into the lexer, rather than preprocess the source file.

This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode.  One answer is, as John notes,
to apply it to the UTF-8 bytes.  But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches.  But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.

It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.

On the other hand, Java uses a 16 bit char, and one should be able to apply
Boyer-Moore just as well in that case.  The tables will be bigger, but then
so are computer memories today.

So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.

Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode
letter, followed by Unicode letters and digits.  Presumably, then, the
designers
of Java compilers have figured this out, I suspect using the 16 bit char.

One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)

Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]

[toc] | [prev] | [next] | [standalone]

#2656

From	gah4 <gah4@u.washington.edu>
Date	2021-05-04 14:47 -0700
Message-ID	<21-05-004@comp.compilers>
In reply to	#2654

On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote:

(snip, I wrote)

> Note that in addition to have a 16 bit Unicode char, the Java language
> itself is defined in terms of Unicode. Variable names can be any Unicode
> letter, followed by Unicode letters and digits. Presumably, then, the
> designers of Java compilers have figured this out, I suspect using the 16 bit char.

(snip)

> Yes, Unicode can be fun!
> [Remember that Unicode is a 20 bit code and for characters outside the first 64K,
> Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8
seem clean and beautiful. -John]

I did know that Java used 16 bits, but never tried to figure out what they did
with the rest of the characters.  There should be enough in the first 64K for writing
programs.

I did once use π for a variable name, with the obvious value.  It seems it is \u03c0.
I even found an editor that allowed entering such characters, and then would write
out the file with \u escapes.  As far as I know, that is more usual than UTF-8.

I believe that the Java parser converts from \u escapes fairly
early, such that you can quote strings with \u0022, and then you
should be able to put \uu0022 inside the strings.

[If you're only going to allow the lower 64K, your users will be sad
when they try to use quoted strings with uncommon Chinese characters
or with emoji, or more likely your compiler will barf since they will
be encoded as two surrogate characters and your lexer won't know what
to do with them. If you're going to deal with Unicode, better bite the
bullet and deal with the whole mess. -John]

[toc] | [prev] | [next] | [standalone]

#2677

From	Hans Aberg <haberg-news@telia.com>
Date	2021-07-14 15:39 -0400
Message-ID	<21-07-002@comp.compilers>
In reply to	#2653

On 2021-05-04 01:58, John Levine wrote:
> [I still think doing UTF-8 as bytes would work fine. Since no UTF-8 encoding
> is a prefix or suffix of any other UTF-8 encoding, you can lex them
> the same way you'd lex strings of ASCII. In that example above, \xCE,
> \xB1..\xCF, and \x89 can never appear alone in UTF-8, only as part of
> a multi-byte sequence, so if they do, you can put a wildcard . at the
> end to match bogus bytes and complain about an invalid character.  Dunno
> what you mean about not always UTF-8; I realize there are mislabeled
> files of UTF-16 that you have to sort out by sniffing the BOM at the
> front, but you do that and turn whatever you're getting into UTF-8
> and then feed it to the lexer.
>
> I agree that lexing Unicode is not a solved problem, and I'm not
> aware of any really good ways to limit the table sizes. -John]

I wrote code, in Haskell and C++, that translates Unicode character
classes into byte classes. From a theoretical standpoint, a Unicode
regular language mapped under UTF-8 is a byte regular language, so it is
possible. So the 2^8 = 256 size tables that Flex uses is enough. The
Flex manual has an example how to make a regular expression replacing
its dot '.' to pick up all legal UTF-8 bytes.

[toc] | [prev] | [standalone]

csiph-web

Re: Lexing Unicode strings?

Contents

#2653 — Re: Lexing Unicode strings?

#2654

#2656

#2677