Groups > comp.compilers > #2394 > unrolled thread

Fragments

Started by	Andy <borucki.andrzej@gmail.com>
First post	2019-12-21 01:52 -0800
Last post	2019-12-22 11:08 -0800
Articles	3 — 3 participants

Back to article view | Back to comp.compilers

  Fragments Andy <borucki.andrzej@gmail.com> - 2019-12-21 01:52 -0800
    Re: Fragments Kaz Kylheku <493-878-3164@kylheku.com> - 2019-12-21 20:07 +0000
    Re: Fragments Ben Hanson <jamin.hanson@googlemail.com> - 2019-12-22 11:08 -0800

#2394 — Fragments

From	Andy <borucki.andrzej@gmail.com>
Date	2019-12-21 01:52 -0800
Subject	Fragments
Message-ID	<19-12-013@comp.compilers>

In examples is usually used very small alphabet: 3 to 5 letters but in
lexical analysing is not only Ascii but many thousands of Unicode.
Many chars are grouped by the same action: for example digits->a
letter->b whitepsaces->c
We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.

How to handle with not disjoint fragments? on input we get regular
expression in Posix standard and we want make DFA with a few
transitions.

[toc] | [next] | [standalone]

#2396

From	Kaz Kylheku <493-878-3164@kylheku.com>
Date	2019-12-21 20:07 +0000
Message-ID	<19-12-015@comp.compilers>
In reply to	#2394

On 2019-12-21, Andy <borucki.andrzej@gmail.com> wrote:
> In examples is usually used very small alphabet: 3 to 5 letters but in
> lexical analysing is not only Ascii but many thousands of Unicode.
> Many chars are grouped by the same action: for example digits->a
> letter->b whitepsaces->c
> We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
> Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.
>
> How to handle with not disjoint fragments? on input we get regular
> expression in Posix standard and we want make DFA with a few
> transitions.

In the TXR language, I wrote a regex engine to handle the full Unicode
range of characters.

To represent character ranges, I wrote a polymorphic "char set" data
type in C. See:

http://www.kylheku.com/cgit/txr/tree/regex.c

The main declaration looks like:

  typedef union char_set {
    struct any_char_set any;
    struct small_char_set s;
    struct displaced_char_set d;
    struct large_char_set l;
  #ifdef FULL_UNICODE
    struct xlarge_char_set xl;
  #endif
  } char_set_t;

Each of the types included in the union has a type field in the same
place which indicates which type of set it is.

The different set types are geared toward different situations.

The small_char_set is a 256 entry bitmask; it handles the ASCII/ISO
Latin 8 bit range.

The large_char_set has a two-level tree structure: basically a sparsely
populated array of bitmasks.

The xlarge_char_set has a three-level structure.

the displaced_char_set is also a small bitmask, but with an offset,
so it can represent characters anywhere in the map, if they fall into
the same 256-wide window.

[toc] | [prev] | [next] | [standalone]

#2398

From	Ben Hanson <jamin.hanson@googlemail.com>
Date	2019-12-22 11:08 -0800
Message-ID	<19-12-017@comp.compilers>
In reply to	#2394

Have a string token type that auto normalises character ranges.

See https://github.com/BenHanson/lexertl14/blob/master/include/lexertl/string_token.hpp
Regards,

Ben

On Saturday, 21 December 2019 18:10:26 UTC, Andy  wrote:
> We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
> Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.

[toc] | [prev] | [standalone]

csiph-web

Fragments

Contents

#2394 — Fragments

#2396

#2398