Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2970
| From | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | RE: Integer sizes and DFAs |
| Date | 2022-03-27 15:02 +0300 |
| Organization | Compilers Central |
| Message-ID | <22-03-076@comp.compilers> (permalink) |
| References | <22-03-073@comp.compilers> <22-03-074@comp.compilers> |
@gah4: Ok, The state space for the whole GenBank is larger. And, if we ever get to the point where we can deal with epigenomic factors, we probably need extra coding space for that too. But, I still think it is likely that we will be in the range where we can make "disks" (or "disk arrays") large enough for them and use 64 bit integers to address them. And, wandering off into the space of hardware implementations. At Intel we were considering doing that circa 2010, so a bit later than you were. In particular, we were looking to see if we could adapt our technology to do the "FASTA" algorithm. Like you, we never brought that to market. But, I want to make a couple of slight points on the encoding of transitions. If one is using byte addressing and one always puts the states on aligned boundaries, one has a few low order bits in the address that will always be 0. You can tweak one of those to be your "match complete/final" bit and then mask that off when using it to address the next state. Alternately, one keeps a separate list of "final states". That is more common in implementations. And, the third choice is to have a state header with extra data, like whether the state is final or not. The last two alternatives expand your memory footprint somewhat, but by a relatively minimal amount. Now, if you don't use byte addressing, but use "word" addressing (a la most computers in the 1960s), you actually have shrunk your address requirement. You've moved those 0 bits from the bottom of the address to the top. Even more so, if you use "state" addressing. Of course, if your hardware has byte addressable memory, you may have to convert those addresses into byte address before going to actual memory. But all of that doesn't involve extra memory lookups, just some small amount of integer/bit arithmetic which on modern computers is essentially irrelevant and gets swamped by pipeline stalls waiting for the cache. Per your idea of keeping them on disk, you can "split" a 64 bit address into a "page address" and an "address within a page" and use the paging mechanism to bring in the relevant pages. I believe some "huge page" implementations actually do that to work well with TLBs and caches. Anyway, what you proposed is not far fetched. Beyond that there are lots of tricks one can play. We thought for a while of Huffman encoding the addresses and using relative pointers. If you are doing something like the Aho-Corasick algorithm, pointers that "match" and lead further on in the machine, so they are always positive numbers, while "failure" pointers are the ones that point backward and those do so to a limited number of states, where absolute addressing makes sense. There is more you can do along those lines, but those are the basic steps. Now, in actual genes there are "network expressions" and "spacers". The network expressions can be segments that are repeated, and that leads to some extra backward links, but the number of them is also small (and those are relative and not failure links). --------- Still, none of this impacts the fact, that when doing analysis, you can still treat those memory references as O(1) and say your lexer is linear in terms of bytes processed per second (and that having more states doesn't impact that), because fundamentally reading from the memory is the bottleneck, and we are always talking fixed memory images (RAM or "disks", that is it is a FINITE state machine) and not reading from a Turing Machine tape. -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
RE: Integer sizes and DFAs Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-27 00:54 +0200
Re: Integer sizes and DFAs gah4 <gah4@u.washington.edu> - 2022-03-26 19:32 -0700
RE: Integer sizes and DFAs Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-27 15:02 +0300
Re: Integer sizes and DFAs gah4 <gah4@u.washington.edu> - 2022-03-26 19:45 -0700
csiph-web