Path: csiph.com!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!nerds-end
From: Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups: comp.compilers
Subject: Re: Integer sizes and DFAs
Date: Sat, 26 Mar 2022 18:39:53 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 57
Sender: news@iecc.com
Approved: comp.compilers@iecc.com
Message-ID: <22-03-072@comp.compilers>
References: <22-03-071@comp.compilers>
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="19106"; mail-complaints-to="abuse@iecc.com"
Keywords: performance, architecture, comment
Posted-Date: 26 Mar 2022 14:49:17 EDT
X-submission-address: compilers@iecc.com
X-moderator-address: compilers-request@iecc.com
X-FAQ-and-archives: http://compilers.iecc.com
Xref: csiph.com comp.compilers:2966

On 2022-03-26, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
> One of the posters mentioned that ignoring the size of integers used
> in the DFAs was "not computer science". Actually, it is.  It's called
> the RAM model of computing.  And, in it you assume each "integer"
> occupies one location and is also sufficient to address any memory
> location.  Sa, accessing any integer any where in memory is an O(1)
> operation.

The O notation is based on the independent variable(s) like n being able
to taken on unlimited, arbitrarily large values.

In certain common situations we model elementary operations as being
O(1), with certain justifications. That simplifies the analysis, while
leaving its results still relevant and true for practical applications
running on present hardware.

The justification needs to be articulated though.

> Note how that even works for the DNA DFAs, assuming 32 bit integers
> and a byte addressed machine.  With that model, you can fit 10 million
> (1E7) states in the 2 billion byte (2GB) address space (each state has
> 4, 4 byte entries or 16 bytes).

Each 32 bit state has 4 byte entries only if it doesn't express that
it branches to other states for certain input symbols.

A DFA state is a row in a table, not just the number which
identifies/enumerates that row.

If the input symbols are bytes, then the abstract row size capped to 256
entries.

(Larger symbol sets can exist (e.g. the example of Unicode) but
it's a reasonabl assumption that the symbol set is finite. Still, it can
be large.)

If we have a DFA built from certain lexical rules, and we extend those
rules, it could be the case that existing states blow up in size,
because of having to branch to many more other states on new characters
that were not included in the old rules.

If a machine initially recongnizes the set of strings { "foo", "bar" }
then its state zero might have a condensed representation of two
transitions on the sybmols 'f' and 'b', such as a tiny two-element
array that is subject to a linear search.

If we extend the recognized set of strings, state 0 may have to branch
on more symbols than just 'f' and 'b'. Depending on how we are
representing states, the size of the object may change.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No existing computer is actually Turing equivalent because every real
computer has a finite memory, but we manage to do O(whatever) calculations
anyway.  I'm not sure how productive this line of argument is beyond noting
what sorts of tasks are or aren't likely to fit in real computers. -John]