Groups > comp.compilers > #387 > unrolled thread

Expected Token Density in Random Stream

Started by	Andrew Tomazos <andrew@tomazos.com>
First post	2011-12-07 15:06 -0800
Last post	2011-12-19 17:48 -0800
Articles	4 — 3 participants

Back to article view | Back to comp.compilers

  Expected Token Density in Random Stream Andrew Tomazos <andrew@tomazos.com> - 2011-12-07 15:06 -0800
    Re: Expected Token Density in Random Stream Kaz Kylheku <kaz@kylheku.com> - 2011-12-11 17:56 +0000
      Re: Expected Token Density in Random Stream Andrew Tomazos <andrew@tomazos.com> - 2011-12-13 06:00 -0800
    Re: Expected Token Density in Random Stream Gene <gene.ressler@gmail.com> - 2011-12-19 17:48 -0800

#387 — Expected Token Density in Random Stream

From	Andrew Tomazos <andrew@tomazos.com>
Date	2011-12-07 15:06 -0800
Subject	Expected Token Density in Random Stream
Message-ID	<11-12-015@comp.compilers>

Summary: We want to find out how often a given token appears in a
random stream formed by concatenating randomly chosen strings from a
given set of strings.

Details:

Given S, an array of n (non-empty) strings; and T, a string of length
m;

We create a random stream of characters by the following process:

    1. assign i = a (uniformly) random integer between 1 and n
inclusive
    2. write the characters of the ith element of S to the stream
    3. goto 1

(The elements of S are not necessarily the same length)

For example:

    if S = {"a", "bug", "ug"}

then the stream may start as follows:

    augbugugugaabug...

Each time T appears in the stream we call that a hit.  More precisely:

   1. Initialize a queue Q with m null elements

   2. If Q equals T record a hit
   3. Pop front of Q
   4. Push to back of Q next char from stream
   5. Goto 2

For example:

   If T = "ugu"...

then:

    augbugugugaabug...
        1 2

We get two hits so far.

(Note hits can overlap each other)

What is the expected frequency of hits in terms of S and T?

More precisely:

    let y be the number of chars read from the stream so far
    let x be the number of hits

    What is the expected limit of x/y as y approaches infinity?

We can approximate this empirically as follows in C++:

   $ cat > HitFinder.cpp
     <paste below code>
   $ g++ -o HitFinder HitFinder.cpp
   $ ./HitFinder ugu a bug ug
T = ugu
A = {a, bug, ug}

(x, y, x/y) = (0, 1, 0)
(x, y, x/y) = (0, 2, 0)
(x, y, x/y) = (0, 4, 0)
(x, y, x/y) = (0, 8, 0)
(x, y, x/y) = (1, 16, 1/16)
(x, y, x/y) = (3, 32, 1/10.6667)
(x, y, x/y) = (8, 64, 1/8)
(x, y, x/y) = (16, 128, 1/8)
 ...
(x, y, x/y) = (29821708, 268435456, 1/9.00134)
(x, y, x/y) = (59648899, 536870912, 1/9.00052)
(x, y, x/y) = (119304725, 1073741824, 1/8.99999)
(x, y, x/y) = (238593314, 2147483648, 1/9.0006)
(x, y, x/y) = (477198099, 4294967296, 1/9.00039)

As you can see the answer for this converges on 1/9

How can we calculate this figure by direct computation?

ie How would you implement a function with signature...

   double ExpectedHitLimit(string T, vector<string> A)

...that quickly calculates this limit for any given T and A?

Thanks and have fun,
Andrew.

--
Andrew Tomazos <andrew@tomazos.com> <http://www.tomazos.com>


// ======================= HitFinder.cpp Cut Here ========================
// (C) 2011, Andrew Tomazos <andrew@tomazos.com>.  All Rights Reserved.

#include <queue>
#include <string>
#include <iostream>
#include <limits>
#include <cstdlib>

using namespace std;

typedef long long int64;
bool IsPow2(int64 i)
{
	return i != 0 && !(i & (i - 1));
}

int64 x = 0;

void Hit()
{
	x++;
}

int Rand(int n)
{
	return rand() % n;
}

class RandCharStream
{
public:
	vector<string> S;
	size_t n, istr, ichar;

	RandCharStream(vector<string> S_) :
		S(S_),
		n(S.size()),
		istr(Rand(n)),
		ichar(0)
	{
	}

	char nextChar()
	{
		if (ichar == S[istr].size())
		{
			istr = Rand(n);
			ichar = 0;
		}

		return S[istr][ichar++];
	}
};

class HitFinder
{
public:
	RandCharStream& stream;
	queue<char> T;
	int m;
	queue<char> Q;

	HitFinder(RandCharStream& stream_, string T_) :
		stream(stream_),
		m(T_.size())
	{
		for (int i = 0; i < m; i++)
		{
			T.push(T_[i]);
			Q.push('\0');
		}
	}

	void nextChar()
	{
		Q.pop();
		Q.push(stream.nextChar());

		if (Q == T)
			Hit();
	}
};

int main(int argc, char** argv)
{
	if (argc < 3)
	{
		cerr << "Usage: " << argv[0] << " <T> <S1> <S2> ... <Sn>" << endl;
		return -1;
	}

	string T = argv[1];
	cout << "T = " << T << endl;

	cout << "A = {";
	vector<string> A;
	for (int i = 2; i < argc; i++)
	{
		string s = argv[i];
		cout << s << (i < argc - 1 ? ", " : "");
		A.push_back(s);
		if (s.empty())
		{
			cerr << "Element S" << i - 1 << " is empty" << endl;
			return -1;
		}
	}
	cout << "}" << endl;
	cout << endl;

	RandCharStream stream(A);
	HitFinder finder(stream, T);

	for (int64 y = 1; y < __LONG_LONG_MAX__; y++)
	{
		finder.nextChar();

		if (IsPow2(y))
		{
			cout << "(x, y, x/y) = (" << x << ", " << y << ", ";
			if (x == 0)
				cout << "0";
			else
				cout << "1/" << double(y) / double(x);
			cout << ")" << endl;
		}
	}

	return 0;
}
//======================= HitFinder.cpp Cut Here ========================

[toc] | [next] | [standalone]

#388

From	Kaz Kylheku <kaz@kylheku.com>
Date	2011-12-11 17:56 +0000
Message-ID	<11-12-016@comp.compilers>
In reply to	#387

On 2011-12-07, Andrew Tomazos <andrew@tomazos.com> wrote:
> Summary: We want to find out how often a given token appears in a
> random stream formed by concatenating randomly chosen strings from a
> given set of strings.

> (Note hits can overlap each other)

But tokens do not overlap, so you're not actually extracting tokens.  Using C
tokens as an example, the C token >>= is one hit, not four.  The longest match
calls for extracting three characters and moving on.

[toc] | [prev] | [next] | [standalone]

#391

From	Andrew Tomazos <andrew@tomazos.com>
Date	2011-12-13 06:00 -0800
Message-ID	<11-12-019@comp.compilers>
In reply to	#388

On Dec 11, 6:56 pm, Kaz Kylheku <k...@kylheku.com> wrote:
> On 2011-12-07, Andrew Tomazos <and...@tomazos.com> wrote:
>
> > Summary: We want to find out how often a given token appears in a
> > random stream formed by concatenating randomly chosen strings from a
> > given set of strings.
> > (Note hits can overlap each other)
>
> But tokens do not overlap, so you're not actually extracting tokens.  Using
> C tokens as an example, the C token >>= is one hit, not four.  The longest
> match calls for extracting three characters and moving on.

Substitute occurrences of the word "token" in my post for "key
string" (or just "string") and reinterpret.
  -Andrew.
[I suppose, but finding tokens would be more interesting. -John]

[toc] | [prev] | [next] | [standalone]

#399

From	Gene <gene.ressler@gmail.com>
Date	2011-12-19 17:48 -0800
Message-ID	<11-12-027@comp.compilers>
In reply to	#387

On Dec 8, 12:06 am, Andrew Tomazos
> Summary: We want to find out how often a given token appears in a
> random stream formed by concatenating randomly chosen strings from a
> given set of strings.
>
> Details:
>
> Given S, an array of n (non-empty) strings; and T, a string of length
> m;
>
> We create a random stream of characters by the following process:
>
>     1. assign i = a (uniformly) random integer between 1 and n
> inclusive
>     2. write the characters of the ith element of S to the stream
>     3. goto 1
>
> (The elements of S are not necessarily the same length)
>
> For example:
>
>     if S = {"a", "bug", "ug"}
>
> then the stream may start as follows:
>
>     augbugugugaabug...
>
> Each time T appears in the stream we call that a hit.  More precisely:
>
>    1. Initialize a queue Q with m null elements
>
>    2. If Q equals T record a hit
>    3. Pop front of Q
>    4. Push to back of Q next char from stream
>    5. Goto 2
>
> For example:
>
>    If T = "ugu"...
>
> then:
>
>     augbugugugaabug...
>         1 2
>
> We get two hits so far.
>
> (Note hits can overlap each other)
>
> What is the expected frequency of hits in terms of S and T?

Some half-baked ideas.  No warranty expressed or implied:

I think you'll need to analyze S exhaustively to enumerate all S index
sequences that can produce hits.  This seems something like genome
matching: string search to find covers. Prefix matching algorithms,
tries and suffix trees might be helpful if T and S members can be
long.  Use the index sequences to build a Mealy machine (DFA) that
consumes a stream of indices and produces respective numbers of hits.
You should be able to analyze this machine as a Markov chain to get
how much time is spent in each state, consequently how many hits are
being generated per index of input.  You'll have to adjust this to get
hits per character rather than hits per index, but that should be
straightforward.  At worst you'd have to split states outputing non-
zero so that all incoming transitions are on indexes of strings having
the same length.  But there might be an easier way.

[toc] | [prev] | [standalone]

csiph-web

Expected Token Density in Random Stream

Contents

#387 — Expected Token Density in Random Stream

#388

#391

#399