Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.chainon-marquant.org!news-transit.tcx.org.uk!rt.uk.eu.org!tr22g12.aset.psu.edu!usenet.stanford.edu!news.iecc.com!nerds-end From: Andrew Tomazos Newsgroups: comp.compilers Subject: Expected Token Density in Random Stream Date: Wed, 7 Dec 2011 15:06:23 -0800 (PST) Organization: Compilers Central Lines: 230 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-12-015@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: leila.iecc.com 1323609083 76892 64.57.183.58 (11 Dec 2011 13:11:23 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Sun, 11 Dec 2011 13:11:23 +0000 (UTC) Keywords: parse Posted-Date: 11 Dec 2011 08:11:23 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:387 Summary: We want to find out how often a given token appears in a random stream formed by concatenating randomly chosen strings from a given set of strings. Details: Given S, an array of n (non-empty) strings; and T, a string of length m; We create a random stream of characters by the following process: 1. assign i = a (uniformly) random integer between 1 and n inclusive 2. write the characters of the ith element of S to the stream 3. goto 1 (The elements of S are not necessarily the same length) For example: if S = {"a", "bug", "ug"} then the stream may start as follows: augbugugugaabug... Each time T appears in the stream we call that a hit. More precisely: 1. Initialize a queue Q with m null elements 2. If Q equals T record a hit 3. Pop front of Q 4. Push to back of Q next char from stream 5. Goto 2 For example: If T = "ugu"... then: augbugugugaabug... 1 2 We get two hits so far. (Note hits can overlap each other) What is the expected frequency of hits in terms of S and T? More precisely: let y be the number of chars read from the stream so far let x be the number of hits What is the expected limit of x/y as y approaches infinity? We can approximate this empirically as follows in C++: $ cat > HitFinder.cpp $ g++ -o HitFinder HitFinder.cpp $ ./HitFinder ugu a bug ug T = ugu A = {a, bug, ug} (x, y, x/y) = (0, 1, 0) (x, y, x/y) = (0, 2, 0) (x, y, x/y) = (0, 4, 0) (x, y, x/y) = (0, 8, 0) (x, y, x/y) = (1, 16, 1/16) (x, y, x/y) = (3, 32, 1/10.6667) (x, y, x/y) = (8, 64, 1/8) (x, y, x/y) = (16, 128, 1/8) ... (x, y, x/y) = (29821708, 268435456, 1/9.00134) (x, y, x/y) = (59648899, 536870912, 1/9.00052) (x, y, x/y) = (119304725, 1073741824, 1/8.99999) (x, y, x/y) = (238593314, 2147483648, 1/9.0006) (x, y, x/y) = (477198099, 4294967296, 1/9.00039) As you can see the answer for this converges on 1/9 How can we calculate this figure by direct computation? ie How would you implement a function with signature... double ExpectedHitLimit(string T, vector A) ...that quickly calculates this limit for any given T and A? Thanks and have fun, Andrew. -- Andrew Tomazos // ======================= HitFinder.cpp Cut Here ======================== // (C) 2011, Andrew Tomazos . All Rights Reserved. #include #include #include #include #include using namespace std; typedef long long int64; bool IsPow2(int64 i) { return i != 0 && !(i & (i - 1)); } int64 x = 0; void Hit() { x++; } int Rand(int n) { return rand() % n; } class RandCharStream { public: vector S; size_t n, istr, ichar; RandCharStream(vector S_) : S(S_), n(S.size()), istr(Rand(n)), ichar(0) { } char nextChar() { if (ichar == S[istr].size()) { istr = Rand(n); ichar = 0; } return S[istr][ichar++]; } }; class HitFinder { public: RandCharStream& stream; queue T; int m; queue Q; HitFinder(RandCharStream& stream_, string T_) : stream(stream_), m(T_.size()) { for (int i = 0; i < m; i++) { T.push(T_[i]); Q.push('\0'); } } void nextChar() { Q.pop(); Q.push(stream.nextChar()); if (Q == T) Hit(); } }; int main(int argc, char** argv) { if (argc < 3) { cerr << "Usage: " << argv[0] << " ... " << endl; return -1; } string T = argv[1]; cout << "T = " << T << endl; cout << "A = {"; vector A; for (int i = 2; i < argc; i++) { string s = argv[i]; cout << s << (i < argc - 1 ? ", " : ""); A.push_back(s); if (s.empty()) { cerr << "Element S" << i - 1 << " is empty" << endl; return -1; } } cout << "}" << endl; cout << endl; RandCharStream stream(A); HitFinder finder(stream, T); for (int64 y = 1; y < __LONG_LONG_MAX__; y++) { finder.nextChar(); if (IsPow2(y)) { cout << "(x, y, x/y) = (" << x << ", " << y << ", "; if (x == 0) cout << "0"; else cout << "1/" << double(y) / double(x); cout << ")" << endl; } } return 0; } //======================= HitFinder.cpp Cut Here ========================