Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!novia!news-out.readnews.com!news-xxxfer.readnews.com!news.misty.com!news.iecc.com!nerds-end From: Gene Newsgroups: comp.compilers Subject: Re: Expected Token Density in Random Stream Date: Mon, 19 Dec 2011 17:48:29 -0800 (PST) Organization: Compilers Central Lines: 67 Sender: news@iecc.com Approved: comp.compilers@iecc.com Message-ID: <11-12-027@comp.compilers> References: <11-12-015@comp.compilers> NNTP-Posting-Host: news.iecc.com X-Trace: leila.iecc.com 1324360430 27771 64.57.183.58 (20 Dec 2011 05:53:50 GMT) X-Complaints-To: abuse@iecc.com NNTP-Posting-Date: Tue, 20 Dec 2011 05:53:50 +0000 (UTC) Keywords: theory Posted-Date: 20 Dec 2011 00:53:50 EST X-submission-address: compilers@iecc.com X-moderator-address: compilers-request@iecc.com X-FAQ-and-archives: http://compilers.iecc.com Xref: x330-a1.tempe.blueboxinc.net comp.compilers:399 On Dec 8, 12:06 am, Andrew Tomazos > Summary: We want to find out how often a given token appears in a > random stream formed by concatenating randomly chosen strings from a > given set of strings. > > Details: > > Given S, an array of n (non-empty) strings; and T, a string of length > m; > > We create a random stream of characters by the following process: > > 1. assign i = a (uniformly) random integer between 1 and n > inclusive > 2. write the characters of the ith element of S to the stream > 3. goto 1 > > (The elements of S are not necessarily the same length) > > For example: > > if S = {"a", "bug", "ug"} > > then the stream may start as follows: > > augbugugugaabug... > > Each time T appears in the stream we call that a hit. More precisely: > > 1. Initialize a queue Q with m null elements > > 2. If Q equals T record a hit > 3. Pop front of Q > 4. Push to back of Q next char from stream > 5. Goto 2 > > For example: > > If T = "ugu"... > > then: > > augbugugugaabug... > 1 2 > > We get two hits so far. > > (Note hits can overlap each other) > > What is the expected frequency of hits in terms of S and T? Some half-baked ideas. No warranty expressed or implied: I think you'll need to analyze S exhaustively to enumerate all S index sequences that can produce hits. This seems something like genome matching: string search to find covers. Prefix matching algorithms, tries and suffix trees might be helpful if T and S members can be long. Use the index sequences to build a Mealy machine (DFA) that consumes a stream of indices and produces respective numbers of hits. You should be able to analyze this machine as a Markov chain to get how much time is spent in each state, consequently how many hits are being generated per index of input. You'll have to adjust this to get hits per character rather than hits per index, but that should be straightforward. At worst you'd have to split states outputing non- zero so that all incoming transitions are on indexes of strings having the same length. But there might be an easier way.