Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2394 > unrolled thread
| Started by | Andy <borucki.andrzej@gmail.com> |
|---|---|
| First post | 2019-12-21 01:52 -0800 |
| Last post | 2019-12-22 11:08 -0800 |
| Articles | 3 — 3 participants |
Back to article view | Back to comp.compilers
Fragments Andy <borucki.andrzej@gmail.com> - 2019-12-21 01:52 -0800
Re: Fragments Kaz Kylheku <493-878-3164@kylheku.com> - 2019-12-21 20:07 +0000
Re: Fragments Ben Hanson <jamin.hanson@googlemail.com> - 2019-12-22 11:08 -0800
| From | Andy <borucki.andrzej@gmail.com> |
|---|---|
| Date | 2019-12-21 01:52 -0800 |
| Subject | Fragments |
| Message-ID | <19-12-013@comp.compilers> |
In examples is usually used very small alphabet: 3 to 5 letters but in lexical analysing is not only Ascii but many thousands of Unicode. Many chars are grouped by the same action: for example digits->a letter->b whitepsaces->c We can use "fragments" [A-Za-z], [0-9] instead of alone letters. Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc. How to handle with not disjoint fragments? on input we get regular expression in Posix standard and we want make DFA with a few transitions.
[toc] | [next] | [standalone]
| From | Kaz Kylheku <493-878-3164@kylheku.com> |
|---|---|
| Date | 2019-12-21 20:07 +0000 |
| Message-ID | <19-12-015@comp.compilers> |
| In reply to | #2394 |
On 2019-12-21, Andy <borucki.andrzej@gmail.com> wrote:
> In examples is usually used very small alphabet: 3 to 5 letters but in
> lexical analysing is not only Ascii but many thousands of Unicode.
> Many chars are grouped by the same action: for example digits->a
> letter->b whitepsaces->c
> We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
> Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.
>
> How to handle with not disjoint fragments? on input we get regular
> expression in Posix standard and we want make DFA with a few
> transitions.
In the TXR language, I wrote a regex engine to handle the full Unicode
range of characters.
To represent character ranges, I wrote a polymorphic "char set" data
type in C. See:
http://www.kylheku.com/cgit/txr/tree/regex.c
The main declaration looks like:
typedef union char_set {
struct any_char_set any;
struct small_char_set s;
struct displaced_char_set d;
struct large_char_set l;
#ifdef FULL_UNICODE
struct xlarge_char_set xl;
#endif
} char_set_t;
Each of the types included in the union has a type field in the same
place which indicates which type of set it is.
The different set types are geared toward different situations.
The small_char_set is a 256 entry bitmask; it handles the ASCII/ISO
Latin 8 bit range.
The large_char_set has a two-level tree structure: basically a sparsely
populated array of bitmasks.
The xlarge_char_set has a three-level structure.
the displaced_char_set is also a small bitmask, but with an offset,
so it can represent characters anywhere in the map, if they fall into
the same 256-wide window.
[toc] | [prev] | [next] | [standalone]
| From | Ben Hanson <jamin.hanson@googlemail.com> |
|---|---|
| Date | 2019-12-22 11:08 -0800 |
| Message-ID | <19-12-017@comp.compilers> |
| In reply to | #2394 |
Have a string token type that auto normalises character ranges. See https://github.com/BenHanson/lexertl14/blob/master/include/lexertl/string_token.hpp Regards, Ben On Saturday, 21 December 2019 18:10:26 UTC, Andy wrote: > We can use "fragments" [A-Za-z], [0-9] instead of alone letters. > Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.compilers
csiph-web