Path: csiph.com!xmission!news.snarked.org!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end From: Kaz Kylheku <493-878-3164@kylheku.com> Newsgroups: comp.compilers Subject: Re: Fragments Date: Sat, 21 Dec 2019 20:07:49 +0000 (UTC) Organization: Aioe.org NNTP Server Lines: 48 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <19-12-015@comp.compilers> References: <19-12-013@comp.compilers> Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="6270"; mail-complaints-to="abuse@iecc.com" Keywords: lex, i18n Posted-Date: 21 Dec 2019 15:28:25 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: csiph.com comp.compilers:2396 On 2019-12-21, Andy wrote: > In examples is usually used very small alphabet: 3 to 5 letters but in > lexical analysing is not only Ascii but many thousands of Unicode. > Many chars are grouped by the same action: for example digits->a > letter->b whitepsaces->c > We can use "fragments" [A-Za-z], [0-9] instead of alone letters. > Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc. > > How to handle with not disjoint fragments? on input we get regular > expression in Posix standard and we want make DFA with a few > transitions. In the TXR language, I wrote a regex engine to handle the full Unicode range of characters. To represent character ranges, I wrote a polymorphic "char set" data type in C. See: http://www.kylheku.com/cgit/txr/tree/regex.c The main declaration looks like: typedef union char_set { struct any_char_set any; struct small_char_set s; struct displaced_char_set d; struct large_char_set l; #ifdef FULL_UNICODE struct xlarge_char_set xl; #endif } char_set_t; Each of the types included in the union has a type field in the same place which indicates which type of set it is. The different set types are geared toward different situations. The small_char_set is a 256 entry bitmask; it handles the ASCII/ISO Latin 8 bit range. The large_char_set has a two-level tree structure: basically a sparsely populated array of bitmasks. The xlarge_char_set has a three-level structure. the displaced_char_set is also a small bitmask, but with an offset, so it can represent characters anywhere in the map, if they fall into the same 256-wide window.