Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Vlastimil Brom <vlastimil.brom@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Detecting repeated subsequences of identical items
Date: Thu, 21 Apr 2016 08:54:04 +0200
Lines: 105
Message-ID: <mailman.5.1461221647.23626.python-list@python.org>
References: <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com> <CAHzaPEPGvxoMAO7f0=DLBDf=dS1AhTP55Pzm_5ANbkOQ-sZAUA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:107439

2016-04-21 5:07 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("AAAABBCDDEEEFFFF"):
>     group = list(group)
>     print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 2
> (ABCDE) count = 2
> F count = 1
> (ABC) count = 3
> B repeats 1 time
>
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 3
> D count = 1
> E count = 1
> A count = 1
> B count = 1
> C count = 1
> D count = 1
> E count = 1
> F count = 1
> (ABC) count = 3
> B count = 1
>
>
>
> How can I do this? Does this problem have a standard name and/or solution?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
if I am not missing something, the latter form of grouping might be
achieved with the following regex:

t="ABCABCABCDEABCDEFABCABCABCB"
grouped = re.findall(r"((?:(\w+?)\2+)|\w+?)", t)
print(grouped)
for grp, subseq in grouped:
    if subseq:
        print(subseq, grp.count(subseq))
    else:
        print(grp, "1")


the printed output is:

[('ABCABCABC', 'ABC'), ('D', ''), ('E', ''), ('A', ''), ('B', ''),
('C', ''), ('D', ''), ('E', ''), ('F', ''), ('ABCABCABC', 'ABC'),
('B', '')]
ABC 3
D 1
E 1
A 1
B 1
C 1
D 1
E 1
F 1
ABC 3
B 1

The former one seems to be more tricky...

hth,
   vbr