Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #2965 > unrolled thread

Integer sizes and DFAs

Started byChristopher F Clark <christopher.f.clark@compiler-resources.com>
First post2022-03-26 17:16 +0200
Last post2022-03-26 18:39 +0000
Articles 2 — 2 participants

Back to article view | Back to comp.compilers


Contents

  Integer sizes and DFAs Christopher F Clark <christopher.f.clark@compiler-resources.com> - 2022-03-26 17:16 +0200
    Re: Integer sizes and DFAs Kaz Kylheku <480-992-1380@kylheku.com> - 2022-03-26 18:39 +0000

#2965 — Integer sizes and DFAs

FromChristopher F Clark <christopher.f.clark@compiler-resources.com>
Date2022-03-26 17:16 +0200
SubjectInteger sizes and DFAs
Message-ID<22-03-071@comp.compilers>
One of the posters mentioned that ignoring the size of integers used
in the DFAs was "not computer science". Actually, it is.  It's called
the RAM model of computing.  And, in it you assume each "integer"
occupies one location and is also sufficient to address any memory
location.  Sa, accessing any integer any where in memory is an O(1)
operation.

Note how that even works for the DNA DFAs, assuming 32 bit integers
and a byte addressed machine.  With that model, you can fit 10 million
(1E7) states in the 2 billion byte (2GB) address space (each state has
4, 4 byte entries or 16 bytes).  You might even fit something over 100
million (1E8) states there.  Thus, your memory access time is linear
in the worst case (assuming every request is a cache miss).

With 64 bit integers, one is essentially only limited by total memory
(including disk) and it is still linear although the worst case time
is every access is a page fault.  Probably not particularly practical,
but CS is about theory and how you model it.

--
******************************************************************************
Chris Clark                  email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc.  Web Site: http://world.std.com/~compres
23 Bailey Rd                 voice: (508) 435-5016
Berlin, MA  01503 USA      twitter: @intel_chris
------------------------------------------------------------------------------

[toc] | [next] | [standalone]


#2966

FromKaz Kylheku <480-992-1380@kylheku.com>
Date2022-03-26 18:39 +0000
Message-ID<22-03-072@comp.compilers>
In reply to#2965
On 2022-03-26, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> One of the posters mentioned that ignoring the size of integers used
> in the DFAs was "not computer science". Actually, it is.  It's called
> the RAM model of computing.  And, in it you assume each "integer"
> occupies one location and is also sufficient to address any memory
> location.  Sa, accessing any integer any where in memory is an O(1)
> operation.

The O notation is based on the independent variable(s) like n being able
to taken on unlimited, arbitrarily large values.

In certain common situations we model elementary operations as being
O(1), with certain justifications. That simplifies the analysis, while
leaving its results still relevant and true for practical applications
running on present hardware.

The justification needs to be articulated though.

> Note how that even works for the DNA DFAs, assuming 32 bit integers
> and a byte addressed machine.  With that model, you can fit 10 million
> (1E7) states in the 2 billion byte (2GB) address space (each state has
> 4, 4 byte entries or 16 bytes).

Each 32 bit state has 4 byte entries only if it doesn't express that
it branches to other states for certain input symbols.

A DFA state is a row in a table, not just the number which
identifies/enumerates that row.

If the input symbols are bytes, then the abstract row size capped to 256
entries.

(Larger symbol sets can exist (e.g. the example of Unicode) but
it's a reasonabl assumption that the symbol set is finite. Still, it can
be large.)

If we have a DFA built from certain lexical rules, and we extend those
rules, it could be the case that existing states blow up in size,
because of having to branch to many more other states on new characters
that were not included in the old rules.

If a machine initially recongnizes the set of strings { "foo", "bar" }
then its state zero might have a condensed representation of two
transitions on the sybmols 'f' and 'b', such as a tiny two-element
array that is subject to a linear search.

If we extend the recognized set of strings, state 0 may have to branch
on more symbols than just 'f' and 'b'. Depending on how we are
representing states, the size of the object may change.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No existing computer is actually Turing equivalent because every real
computer has a finite memory, but we manage to do O(whatever) calculations
anyway.  I'm not sure how productive this line of argument is beyond noting
what sorts of tasks are or aren't likely to fit in real computers. -John]

[toc] | [prev] | [standalone]


Back to top | Article view | comp.compilers


csiph-web