Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Michael Selik Newsgroups: comp.lang.python Subject: Re: Detecting repeated subsequences of identical items Date: Thu, 21 Apr 2016 07:05:29 +0000 Lines: 24 Message-ID: References: <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de kNGN27XfkYjnRAGaFFQLYAAFDFT+EJ+hhpthPJHtpeRA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.007 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'patterns': 0.04; '21,': 0.07; 'repeated': 0.07; 'thu,': 0.15; '2016': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'repetitions': 0.16; 'subject:Detecting': 0.16; 'subsequences': 0.16; 'wrote:': 0.16; 'obviously': 0.16; 'nested': 0.18; '>': 0.18; 'email addr:gmail.com>': 0.18; 'to:2**1': 0.21; 'context.': 0.22; 'latter': 0.22; 'seems': 0.23; 'header:In-Reply-To:1': 0.24; 'skip:" 20': 0.26; 'right.': 0.27; 'message-id:@mail.gmail.com': 0.27; 'grouping': 0.29; 'skip:& 30': 0.30; 'maybe': 0.33; 'problem': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'skip:& 20': 0.35; 'received:google.com': 0.35; 'received:74.125.82': 0.35; 'but': 0.36; 'should': 0.36; 'depends': 0.36; 'to:addr:python- list': 0.36; 'subject:: ': 0.37; 'say': 0.37; 'missing': 0.37; 'rather': 0.39; 'to:addr:python.org': 0.40; 'more': 0.63; 'biggest': 0.67; 'or:': 0.84; 'to:name:python': 0.84; 'abc': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ml/gewk3bIli0eIUPsxcgwpnnkdBniF64fYn+Bus9Kw=; b=UdnEpDrzWVusmuf6eIrwan469WcQ1Yl0rQgeO6vHQFeR8hifwVgpu2ql2ulaH7XQY+ 7LVuN6uxME/d3Kl7Biq3/HMazgSPfXjfWDdLWy7B1tWNno6TRiBJHrRIUhLxurCiqIP1 F8UFzAB8nkZM5dqHmfy4QM/jz1YwgXHDMj6QsK19H8Yan5Po+xqBqq5f7KWxhdiW7SJ7 CVzaxTACG2VuWIJJxuPAbuTXGXLcSHngqMGTNcUgdza5dVFAvQpDOLYPJ3s6e55yWqlU rAQ5JVatBcsXWDYlZWEr3JFH3ILGZ6+QP0Px1Te2r8rBAH0nDWsPRr6dqAUEXMBxw8ra SfFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ml/gewk3bIli0eIUPsxcgwpnnkdBniF64fYn+Bus9Kw=; b=OTl6GgsEjqI6S6Eha5Gt5AiWb6EGfIr/BvkXxY7+T98q1FRXvnJUv6m6Liv0eyKubS jzlLQwCFoMNnnjv0zpucaym6/PuozwiJB5kvdKHlPiOdZUpTDux5OwOF/pvpMS1nOACm Fjce+E9jpV0paZgJYsnX4DYTpCKTZxvVTW9St0a0l0DRH0x2USEuqB0b1zO2C2aqWHD+ RrfLgH7H68EIPUtw8KOZjruz98LqNxmA5wb0pA5T3Q0xxS9ChB05UZd9REOIgcwTBcoR ix9m1diBDB8rqLreU5sEE9Dmpe2cBoKOBxjKK44Tkj+YE2s2Z0CrqxmtXIvLC92QhW4j Eb5w== X-Gm-Message-State: AOPr4FVwrtXen148lSpuaTLvJaxJTpVUX7YqlXuea+qvHEDbKM4cP0Sp9C/rKeXRS3uXNTlb4XQqzCTVfkVcrw== X-Received: by 10.194.184.38 with SMTP id er6mr11935306wjc.93.1461222339548; Thu, 21 Apr 2016 00:05:39 -0700 (PDT) In-Reply-To: X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> Xref: csiph.com comp.lang.python:107440 On Thu, Apr 21, 2016 at 2:55 AM Vlastimil Brom wrote: > 2016-04-21 5:07 GMT+02:00 Steven D'Aprano : > > I want to group subsequences. > > "ABCABCABCDEABCDEFABCABCABCB" > > ABC ABC ABCDE ABCDE F ABC ABC ABC B > > or: > > ABC ABC ABC D E A B C D E F ABC ABC ABC B > > if I am not missing something, the latter form of grouping might be > achieved with the following regex: [snip] > The former one seems to be more tricky... > Right. If the problem is constrained to say that repeated subsequences can have no nested repeated subsequences, it's much easier to solve. If you had "ABCABCABCABC" should that result in ABC ABC ABC ABC, with 4 repetitions or ABCABC ABCABC with 2 repetitions? In this example, one might say the higher count is obviously better, but I think it depends on the context. Maybe the user is looking for the biggest patterns rather than the biggest counts.