Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.compilers > #387

Expected Token Density in Random Stream

From Andrew Tomazos <andrew@tomazos.com>
Newsgroups comp.compilers
Subject Expected Token Density in Random Stream
Date 2011-12-07 15:06 -0800
Organization Compilers Central
Message-ID <11-12-015@comp.compilers> (permalink)

Show all headers | View raw


Summary: We want to find out how often a given token appears in a
random stream formed by concatenating randomly chosen strings from a
given set of strings.

Details:

Given S, an array of n (non-empty) strings; and T, a string of length
m;

We create a random stream of characters by the following process:

    1. assign i = a (uniformly) random integer between 1 and n
inclusive
    2. write the characters of the ith element of S to the stream
    3. goto 1

(The elements of S are not necessarily the same length)

For example:

    if S = {"a", "bug", "ug"}

then the stream may start as follows:

    augbugugugaabug...

Each time T appears in the stream we call that a hit.  More precisely:

   1. Initialize a queue Q with m null elements

   2. If Q equals T record a hit
   3. Pop front of Q
   4. Push to back of Q next char from stream
   5. Goto 2

For example:

   If T = "ugu"...

then:

    augbugugugaabug...
        1 2

We get two hits so far.

(Note hits can overlap each other)

What is the expected frequency of hits in terms of S and T?

More precisely:

    let y be the number of chars read from the stream so far
    let x be the number of hits

    What is the expected limit of x/y as y approaches infinity?

We can approximate this empirically as follows in C++:

   $ cat > HitFinder.cpp
     <paste below code>
   $ g++ -o HitFinder HitFinder.cpp
   $ ./HitFinder ugu a bug ug
T = ugu
A = {a, bug, ug}

(x, y, x/y) = (0, 1, 0)
(x, y, x/y) = (0, 2, 0)
(x, y, x/y) = (0, 4, 0)
(x, y, x/y) = (0, 8, 0)
(x, y, x/y) = (1, 16, 1/16)
(x, y, x/y) = (3, 32, 1/10.6667)
(x, y, x/y) = (8, 64, 1/8)
(x, y, x/y) = (16, 128, 1/8)
 ...
(x, y, x/y) = (29821708, 268435456, 1/9.00134)
(x, y, x/y) = (59648899, 536870912, 1/9.00052)
(x, y, x/y) = (119304725, 1073741824, 1/8.99999)
(x, y, x/y) = (238593314, 2147483648, 1/9.0006)
(x, y, x/y) = (477198099, 4294967296, 1/9.00039)

As you can see the answer for this converges on 1/9

How can we calculate this figure by direct computation?

ie How would you implement a function with signature...

   double ExpectedHitLimit(string T, vector<string> A)

...that quickly calculates this limit for any given T and A?

Thanks and have fun,
Andrew.

--
Andrew Tomazos <andrew@tomazos.com> <http://www.tomazos.com>


// ======================= HitFinder.cpp Cut Here ========================
// (C) 2011, Andrew Tomazos <andrew@tomazos.com>.  All Rights Reserved.

#include <queue>
#include <string>
#include <iostream>
#include <limits>
#include <cstdlib>

using namespace std;

typedef long long int64;
bool IsPow2(int64 i)
{
	return i != 0 && !(i & (i - 1));
}

int64 x = 0;

void Hit()
{
	x++;
}

int Rand(int n)
{
	return rand() % n;
}

class RandCharStream
{
public:
	vector<string> S;
	size_t n, istr, ichar;

	RandCharStream(vector<string> S_) :
		S(S_),
		n(S.size()),
		istr(Rand(n)),
		ichar(0)
	{
	}

	char nextChar()
	{
		if (ichar == S[istr].size())
		{
			istr = Rand(n);
			ichar = 0;
		}

		return S[istr][ichar++];
	}
};

class HitFinder
{
public:
	RandCharStream& stream;
	queue<char> T;
	int m;
	queue<char> Q;

	HitFinder(RandCharStream& stream_, string T_) :
		stream(stream_),
		m(T_.size())
	{
		for (int i = 0; i < m; i++)
		{
			T.push(T_[i]);
			Q.push('\0');
		}
	}

	void nextChar()
	{
		Q.pop();
		Q.push(stream.nextChar());

		if (Q == T)
			Hit();
	}
};

int main(int argc, char** argv)
{
	if (argc < 3)
	{
		cerr << "Usage: " << argv[0] << " <T> <S1> <S2> ... <Sn>" << endl;
		return -1;
	}

	string T = argv[1];
	cout << "T = " << T << endl;

	cout << "A = {";
	vector<string> A;
	for (int i = 2; i < argc; i++)
	{
		string s = argv[i];
		cout << s << (i < argc - 1 ? ", " : "");
		A.push_back(s);
		if (s.empty())
		{
			cerr << "Element S" << i - 1 << " is empty" << endl;
			return -1;
		}
	}
	cout << "}" << endl;
	cout << endl;

	RandCharStream stream(A);
	HitFinder finder(stream, T);

	for (int64 y = 1; y < __LONG_LONG_MAX__; y++)
	{
		finder.nextChar();

		if (IsPow2(y))
		{
			cout << "(x, y, x/y) = (" << x << ", " << y << ", ";
			if (x == 0)
				cout << "0";
			else
				cout << "1/" << double(y) / double(x);
			cout << ")" << endl;
		}
	}

	return 0;
}
//======================= HitFinder.cpp Cut Here ========================

Back to comp.compilers | Previous | NextNext in thread | Find similar


Thread

Expected Token Density in Random Stream Andrew Tomazos <andrew@tomazos.com> - 2011-12-07 15:06 -0800
  Re: Expected Token Density in Random Stream Kaz Kylheku <kaz@kylheku.com> - 2011-12-11 17:56 +0000
    Re: Expected Token Density in Random Stream Andrew Tomazos <andrew@tomazos.com> - 2011-12-13 06:00 -0800
  Re: Expected Token Density in Random Stream Gene <gene.ressler@gmail.com> - 2011-12-19 17:48 -0800

csiph-web