Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #107438

Re: Detecting repeated subsequences of identical items

Path csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From Michael Selik <michael.selik@gmail.com>
Newsgroups comp.lang.python
Subject Re: Detecting repeated subsequences of identical items
Date Thu, 21 Apr 2016 06:49:52 +0000
Lines 39
Message-ID <mailman.4.1461221408.23626.python-list@python.org> (permalink)
References <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> <CAGgTfkMFqsiqfb-bV7e10D+FNXCh-TLeSr5vP37xdffNy-a0aw@mail.gmail.com> <CAGgTfkPRaq3+s3kuSvGa7k_-tFhYSuimjcm6mfhEWqbd2vY3oQ@mail.gmail.com>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8
X-Trace news.uni-berlin.de tVduFQhxiI3xpd0hSoizOQxCW4Ee6Adw6w1Y/Vxe8EMg==
Return-Path <michael.selik@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.038
X-Spam-Evidence '*H*': 0.92; '*S*': 0.00; '21,': 0.07; 'subsequence': 0.07; 'wed,': 0.15; 'thu,': 0.15; '2016': 0.16; 'clustering': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'repetitions': 0.16; 'statement.': 0.16; 'subject:Detecting': 0.16; 'subsequences': 0.16; 'to:addr:pearwood.info': 0.16; "to:name:steven d'aprano": 0.16; 'wrote:': 0.16; 'have:': 0.18; 'nested': 0.18; 'email addr:gmail.com&gt;': 0.18; 'first,': 0.20; 'to:2**1': 0.21; 'url:edu': 0.24; 'header:In-Reply-To:1': 0.24; 'skip:" 20': 0.26; 'least': 0.27; 'message-id:@mail.gmail.com': 0.27; 'defining': 0.27; 'correct': 0.28; '"no': 0.29; "i'm": 0.30; 'skip:& 30': 0.30; 'checked': 0.31; 'problem': 0.33; 'michael': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'open': 0.33; 'this?': 0.34; 'add': 0.34; 'received:google.com': 0.35; 'received:74.125.82': 0.35; 'problem.': 0.35; 'but': 0.36; 'too': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'skip:& 10': 0.37; 'detail': 0.38; 'minimum': 0.38; 'does': 0.39; 'to:addr:python.org': 0.40; 'some': 0.40; 'your': 0.60; 'more': 0.63; 'url:pdf': 0.64; 'series': 0.65; '20,': 0.66; 'url:2014': 0.66; 'results': 0.66; 'results.': 0.67; 'sound': 0.72; 'paper': 0.73; 'sounds': 0.76; 'music': 0.78; 'metaphor.': 0.84; 'territory.': 0.84; 'url:ucr': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=dHquGvXssRwbPumreBDYkgsWa9hdjjnk5w/yFJfiqXY=; b=hz7mOVBPTDqiiRQrFzCaq/c6nOs7K2grzPI2dHKcVUXngICdf4wwgftIzShNjxu+jm iNq2unETa8NyP1vvuLAC4qZi+UXxI3Om4eA/snF8ls7WdA2qUBBisVOrsEFb+yPW9EjO uQluqc34G+L6kGhUznOUKksR3tCiBKsd13eHKpws9SNYhsweaMdnCDAW7o80Uvqhags6 tvqBYNPuumKZGQ/QHgkxQVYKMp2NxYgabwKmWRoPImQiPYowsj3VRwHtZutD+nWUcHTS VtQuDNdFFwiCmnWSirxBqlRJbYy6/KtRnQiestUI28LOWynUGcHwegNgOFBrC50gctVq LdVw==
X-Google-DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=dHquGvXssRwbPumreBDYkgsWa9hdjjnk5w/yFJfiqXY=; b=c2+6KL6zJ50H0fNgd+JNfPMlbnqYmpAWzpMolyHEl5GlGwvhan8LKRA90DsgEPoVw+ +xUXmOqqIaMjbENNXtYVlulMEj3ifpzKgxcQloEOPANK1H228DbRaAgbxqyuR953XVOX wUMDbxVH1DMckHcHJRJ47EW+PMLlNjUCiBXke1RJTDVHv2+lHAm6Nylv0Zl/FO7BEp85 gfWAZwLEmXDFkeWiEqQ0Kr/Y40gGOedDsvLy2JpK/i3FRdpW6uiOEgnSuRIOYsrAcO7R tUIjS6mZ3icXm64WNzCBdDD7vQRkqrg22ld8xHcD+fgD7gI5rBQQ/eNYQMZAtlSUIZcD 6Sng==
X-Gm-Message-State AOPr4FWUvBMGh75uh7bDtTbEu5uJlV1V5tYZIh0H6dSKXBtsluP46rl0Pnw/D6pC9wtlC+dH3aLjG1dpjO85xg==
X-Received by 10.28.20.198 with SMTP id 189mr34218581wmu.103.1461221401421; Wed, 20 Apr 2016 23:50:01 -0700 (PDT)
In-Reply-To <CAGgTfkMFqsiqfb-bV7e10D+FNXCh-TLeSr5vP37xdffNy-a0aw@mail.gmail.com>
X-Content-Filtered-By Mailman/MimeDel 2.1.22
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.22
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID <CAGgTfkPRaq3+s3kuSvGa7k_-tFhYSuimjcm6mfhEWqbd2vY3oQ@mail.gmail.com>
X-Mailman-Original-References <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> <CAGgTfkMFqsiqfb-bV7e10D+FNXCh-TLeSr5vP37xdffNy-a0aw@mail.gmail.com>
Xref csiph.com comp.lang.python:107438

Show key headers only | View raw


On Thu, Apr 21, 2016 at 2:35 AM Michael Selik <michael.selik@gmail.com>
wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve@pearwood.info>
> wrote:
>
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
>
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add more
> specificity to the problem statement.
>
> Regexes may sound tempting at first, but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.
>
> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
>
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
>
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
>
> Any more detail on the problem to add constraints?
>

Some light reading suggests that you can improve your problem by defining a
minimum size for a subsequence to qualify. One paper suggests calling these
more interesting repetitions a "motif" to use a music metaphor. Looking for
any repetitions results in too many trivial results. Is that valid for your
usage?

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-21 13:07 +1000
  Re: Detecting repeated subsequences of identical items Ethan Furman <ethan@stoneleaf.us> - 2016-04-20 20:57 -0700
  Re: Detecting repeated subsequences of identical items Ethan Furman <ethan@stoneleaf.us> - 2016-04-20 21:15 -0700
  Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-21 15:37 +1000
  Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 06:35 +0000
    Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-04-21 18:05 +1000
      Re: Detecting repeated subsequences of identical items Nobody <nobody@nowhere.invalid> - 2016-04-21 13:02 +0100
  Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 06:49 +0000
  Re: Detecting repeated subsequences of identical items Vlastimil Brom <vlastimil.brom@gmail.com> - 2016-04-21 08:54 +0200
  Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 07:05 +0000
  Re: Detecting repeated subsequences of identical items Alain Ketterlin <alain@universite-de-strasbourg.fr.invalid> - 2016-04-21 09:25 +0200
  Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 09:53 +0100
    Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-21 22:15 +1000
      Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 15:01 +0100
      Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-22 00:12 +1000
        Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-23 01:00 +1000
      Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 15:30 +0100
      Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-22 01:02 +1000
  Re: Detecting repeated subsequences of identical items Serhiy Storchaka <storchaka@gmail.com> - 2016-04-21 14:56 +0300

csiph-web