Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Michael Selik <michael.selik@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Detecting repeated subsequences of identical items
Date: Thu, 21 Apr 2016 06:49:52 +0000
Lines: 39
Message-ID: <mailman.4.1461221408.23626.python-list@python.org>
References: <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> <CAGgTfkMFqsiqfb-bV7e10D+FNXCh-TLeSr5vP37xdffNy-a0aw@mail.gmail.com> <CAGgTfkPRaq3+s3kuSvGa7k_-tFhYSuimjcm6mfhEWqbd2vY3oQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <CAGgTfkMFqsiqfb-bV7e10D+FNXCh-TLeSr5vP37xdffNy-a0aw@mail.gmail.com>
Precedence: list
Xref: csiph.com comp.lang.python:107438

On Thu, Apr 21, 2016 at 2:35 AM Michael Selik <michael.selik@gmail.com>
wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve@pearwood.info>
> wrote:
>
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
>
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add more
> specificity to the problem statement.
>
> Regexes may sound tempting at first, but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.
>
> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
>
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
>
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
>
> Any more detail on the problem to add constraints?
>

Some light reading suggests that you can improve your problem by defining a
minimum size for a subsequence to qualify. One paper suggests calling these
more interesting repetitions a "motif" to use a music metaphor. Looking for
any repetitions results in too many trivial results. Is that valid for your
usage?