Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2396
| From | Kaz Kylheku <493-878-3164@kylheku.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: Fragments |
| Date | 2019-12-21 20:07 +0000 |
| Organization | Aioe.org NNTP Server |
| Message-ID | <19-12-015@comp.compilers> (permalink) |
| References | <19-12-013@comp.compilers> |
On 2019-12-21, Andy <borucki.andrzej@gmail.com> wrote:
> In examples is usually used very small alphabet: 3 to 5 letters but in
> lexical analysing is not only Ascii but many thousands of Unicode.
> Many chars are grouped by the same action: for example digits->a
> letter->b whitepsaces->c
> We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
> Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.
>
> How to handle with not disjoint fragments? on input we get regular
> expression in Posix standard and we want make DFA with a few
> transitions.
In the TXR language, I wrote a regex engine to handle the full Unicode
range of characters.
To represent character ranges, I wrote a polymorphic "char set" data
type in C. See:
http://www.kylheku.com/cgit/txr/tree/regex.c
The main declaration looks like:
typedef union char_set {
struct any_char_set any;
struct small_char_set s;
struct displaced_char_set d;
struct large_char_set l;
#ifdef FULL_UNICODE
struct xlarge_char_set xl;
#endif
} char_set_t;
Each of the types included in the union has a type field in the same
place which indicates which type of set it is.
The different set types are geared toward different situations.
The small_char_set is a 256 entry bitmask; it handles the ASCII/ISO
Latin 8 bit range.
The large_char_set has a two-level tree structure: basically a sparsely
populated array of bitmasks.
The xlarge_char_set has a three-level structure.
the displaced_char_set is also a small bitmask, but with an offset,
so it can represent characters anywhere in the map, if they fall into
the same 256-wide window.
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Fragments Andy <borucki.andrzej@gmail.com> - 2019-12-21 01:52 -0800 Re: Fragments Kaz Kylheku <493-878-3164@kylheku.com> - 2019-12-21 20:07 +0000 Re: Fragments Ben Hanson <jamin.hanson@googlemail.com> - 2019-12-22 11:08 -0800
csiph-web