Groups > comp.lang.python > #107432 > unrolled thread

Detecting repeated subsequences of identical items

Started by	Steven D'Aprano <steve@pearwood.info>
First post	2016-04-21 13:07 +1000
Last post	2016-04-21 14:56 +0300
Articles	19 — 10 participants

Back to article view | Back to comp.lang.python

  Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-21 13:07 +1000
    Re: Detecting repeated subsequences of identical items Ethan Furman <ethan@stoneleaf.us> - 2016-04-20 20:57 -0700
    Re: Detecting repeated subsequences of identical items Ethan Furman <ethan@stoneleaf.us> - 2016-04-20 21:15 -0700
    Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-21 15:37 +1000
    Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 06:35 +0000
      Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-04-21 18:05 +1000
        Re: Detecting repeated subsequences of identical items Nobody <nobody@nowhere.invalid> - 2016-04-21 13:02 +0100
    Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 06:49 +0000
    Re: Detecting repeated subsequences of identical items Vlastimil Brom <vlastimil.brom@gmail.com> - 2016-04-21 08:54 +0200
    Re: Detecting repeated subsequences of identical items Michael Selik <michael.selik@gmail.com> - 2016-04-21 07:05 +0000
    Re: Detecting repeated subsequences of identical items Alain Ketterlin <alain@universite-de-strasbourg.fr.invalid> - 2016-04-21 09:25 +0200
    Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 09:53 +0100
      Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-21 22:15 +1000
        Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 15:01 +0100
        Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-22 00:12 +1000
          Re: Detecting repeated subsequences of identical items Steven D'Aprano <steve@pearwood.info> - 2016-04-23 01:00 +1000
        Re: Detecting repeated subsequences of identical items Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-04-21 15:30 +0100
        Re: Detecting repeated subsequences of identical items Chris Angelico <rosuav@gmail.com> - 2016-04-22 01:02 +1000
    Re: Detecting repeated subsequences of identical items Serhiy Storchaka <storchaka@gmail.com> - 2016-04-21 14:56 +0300

#107432 — Detecting repeated subsequences of identical items

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-21 13:07 +1000
Subject	Detecting repeated subsequences of identical items
Message-ID	<571843f9$0$1585$c3e8da3$5496439d@news.astraweb.com>

I want to group repeated items in a sequence. For example, I can group
repeated sequences of a single item at a time using groupby:


from itertools import groupby
for key, group in groupby("AAAABBCDDEEEFFFF"):
    group = list(group)
    print(key, "count =", len(group))


outputs:

A count = 4
B count = 2
C count = 1
D count = 2
E count = 3
F count = 4


Now I want to group subsequences. For example, I have:

"ABCABCABCDEABCDEFABCABCABCB"

and I want to group it into repeating subsequences. I can see two ways to
group it:

ABC ABC ABCDE ABCDE F ABC ABC ABC B

giving counts:

(ABC) count = 2
(ABCDE) count = 2
F count = 1
(ABC) count = 3
B repeats 1 time


or:

ABC ABC ABC D E A B C D E F ABC ABC ABC B

giving counts:

(ABC) count = 3
D count = 1
E count = 1
A count = 1
B count = 1
C count = 1
D count = 1
E count = 1
F count = 1
(ABC) count = 3
B count = 1



How can I do this? Does this problem have a standard name and/or solution?




-- 
Steven

[toc] | [next] | [standalone]

#107434

From	Ethan Furman <ethan@stoneleaf.us>
Date	2016-04-20 20:57 -0700
Message-ID	<mailman.0.1461210969.23626.python-list@python.org>
In reply to	#107432

On 04/20/2016 08:07 PM, Steven D'Aprano wrote:

> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 2
> (ABCDE) count = 2
> F count = 1
> (ABC) count = 3
> B repeats 1 time

or

ABC ABC ABC D E A B C D E F ABC ABC B

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#107435

From	Ethan Furman <ethan@stoneleaf.us>
Date	2016-04-20 21:15 -0700
Message-ID	<mailman.1.1461212038.23626.python-list@python.org>
In reply to	#107432

On 04/20/2016 08:57 PM, Ethan Furman wrote:

 > [snip same pattern as Steven wrote]

Nevermind.  It's obviously time for me to go to bed.  :/

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#107436

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-21 15:37 +1000
Message-ID	<mailman.2.1461217068.23626.python-list@python.org>
In reply to	#107432

On Thu, Apr 21, 2016 at 1:07 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B

Interesting. I've *almost* managed to (ab)use re.split for this
purpose. A one-step solution can be done with re.match:

>>> txt = "ABCABCABCDEABCDEFABCABCABCB"
>>> re.match(r'(.+)\1+', txt)
<_sre.SRE_Match object; span=(0, 9), match='ABCABCABC'>

But split then returns only the grouped part:

>>> re.split(r'(.+)\1+', txt)
['', 'ABC', 'DEABCDEF', 'ABC', 'B']

or *all* the grouped parts:

>>> re.split(r'((.+)\2+)', txt)
['', 'ABCABCABC', 'ABC', 'DEABCDEF', 'ABCABCABC', 'ABC', 'B']

There's definitely a partial solution happening here, but I can't
quite make it work.

And no, I don't know if there's a standard name for it.

ChrisA

[toc] | [prev] | [next] | [standalone]

#107437

From	Michael Selik <michael.selik@gmail.com>
Date	2016-04-21 06:35 +0000
Message-ID	<mailman.3.1461220545.23626.python-list@python.org>
In reply to	#107432

On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve@pearwood.info>
wrote:

> I want to group [repeated] subsequences. For example, I have:
> "ABCABCABCDEABCDEFABCABCABCB"
> and I want to group it into repeating subsequences. I can see two
> ways... How can I do this? Does this problem have a standard name and/or
> solution?
>

I'm not aware of a standard name. This sounds like an unsupervised learning
problem. There's no objectively correct answer unless you add more
specificity to the problem statement.

Regexes may sound tempting at first, but because a repeating subsequence
may have nested repeating subsequences and this can go on infinitely, I
think we at least need a push-down automata.

I checked out some links for clustering algorithms that work on series
subsequences and I found some fun results.

Clustering is meaningless!
http://www.cs.ucr.edu/~eamonn/meaningless.pdf

I think you're in "no free lunch" territory. "Clustering of subsequence
time series remains an open issue in time series clustering"
http://www.hindawi.com/journals/tswj/2014/312521/

Any more detail on the problem to add constraints?

[toc] | [prev] | [next] | [standalone]

#107443

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-04-21 18:05 +1000
Message-ID	<571889d8$0$2820$c3e8da3$76491128@news.astraweb.com>
In reply to	#107437

On Thursday 21 April 2016 16:35, Michael Selik wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve@pearwood.info>
> wrote:
> 
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
> 
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add
> more specificity to the problem statement.
> 
> Regexes may sound tempting at first, 

Ah, I may have mislead you all. I cannot use regexes, since the *actual* 
sequences I'm working on are sequences (lists, tuples, etc) or iterators of 
arbitrary items. The items themselves may be strings, but the sequence 
itself is definitely not a string. I just showed a string for convenience. 
Sorry about that.

So no regexes.

> but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.

Fortunately, for my *immediate* problem, I would be good with some fairly 
arbitrary restrictions on subsequence detection.

> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
> 
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
> 
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
> 
> Any more detail on the problem to add constraints?

The specific problem I am trying to solve is that I have a sequence of 
strings (in this case, error messages from a Python traceback) and I'm 
looking for repeated groups that may indicate mutually recursive calls. E.g. 
suppose I have a function f which calls g, and g calls h, and h calls f 
again, and there's an exception, you will see a traceback in part:

  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 7, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h

etc. Note that I only care about lines which are identical, e.g. if the line 
numbers differ (as in the last call to f), they will be treated as distinct 
items. So I'd like to group the above as:

  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  *** above 3 calls repeated 3 times ***
  File "<stdin>", line 7, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#107449

From	Nobody <nobody@nowhere.invalid>
Date	2016-04-21 13:02 +0100
Message-ID	<pan.2016.04.21.12.03.44.178000@nowhere.invalid>
In reply to	#107443

On Thu, 21 Apr 2016 18:05:40 +1000, Steven D'Aprano wrote:

> The specific problem I am trying to solve is that I have a sequence of 
> strings (in this case, error messages from a Python traceback) and I'm 
> looking for repeated groups that may indicate mutually recursive calls. E.g. 
> suppose I have a function f which calls g, and g calls h, and h calls f 
> again, and there's an exception, you will see a traceback in part:

This is a specific case of finding cycles in a directed graph. But
treating it as such probably isn't useful here, because you're interested
in a specific traversal of that graph rather than the graph itself.

One way to approach it is:

sofar = []
for line in traceback:
    if line in sofar:
        j = sofar.index(line)
        if sofar[:j] == sofar[j:j*2]:
            # found repeat
    sofar = [line] + sofar

Note that sofar needs to be in reverse order, because list doesn't have
.rindex() or .rfind().

Detecting nested cycles is somewhat harder because given e.g.

	ababxabababababxababab

you'd want the five repeats of ab in the middle to be treated as two
repeats of ab followed by three repeats of ab, but there's no way to
spot that until later.

[toc] | [prev] | [next] | [standalone]

#107438

From	Michael Selik <michael.selik@gmail.com>
Date	2016-04-21 06:49 +0000
Message-ID	<mailman.4.1461221408.23626.python-list@python.org>
In reply to	#107432

On Thu, Apr 21, 2016 at 2:35 AM Michael Selik <michael.selik@gmail.com>
wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve@pearwood.info>
> wrote:
>
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
>
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add more
> specificity to the problem statement.
>
> Regexes may sound tempting at first, but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.
>
> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
>
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
>
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
>
> Any more detail on the problem to add constraints?
>

Some light reading suggests that you can improve your problem by defining a
minimum size for a subsequence to qualify. One paper suggests calling these
more interesting repetitions a "motif" to use a music metaphor. Looking for
any repetitions results in too many trivial results. Is that valid for your
usage?

[toc] | [prev] | [next] | [standalone]

#107439

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2016-04-21 08:54 +0200
Message-ID	<mailman.5.1461221647.23626.python-list@python.org>
In reply to	#107432

2016-04-21 5:07 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("AAAABBCDDEEEFFFF"):
>     group = list(group)
>     print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 2
> (ABCDE) count = 2
> F count = 1
> (ABC) count = 3
> B repeats 1 time
>
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 3
> D count = 1
> E count = 1
> A count = 1
> B count = 1
> C count = 1
> D count = 1
> E count = 1
> F count = 1
> (ABC) count = 3
> B count = 1
>
>
>
> How can I do this? Does this problem have a standard name and/or solution?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
if I am not missing something, the latter form of grouping might be
achieved with the following regex:

t="ABCABCABCDEABCDEFABCABCABCB"
grouped = re.findall(r"((?:(\w+?)\2+)|\w+?)", t)
print(grouped)
for grp, subseq in grouped:
    if subseq:
        print(subseq, grp.count(subseq))
    else:
        print(grp, "1")


the printed output is:

[('ABCABCABC', 'ABC'), ('D', ''), ('E', ''), ('A', ''), ('B', ''),
('C', ''), ('D', ''), ('E', ''), ('F', ''), ('ABCABCABC', 'ABC'),
('B', '')]
ABC 3
D 1
E 1
A 1
B 1
C 1
D 1
E 1
F 1
ABC 3
B 1

The former one seems to be more tricky...

hth,
   vbr

[toc] | [prev] | [next] | [standalone]

#107440

From	Michael Selik <michael.selik@gmail.com>
Date	2016-04-21 07:05 +0000
Message-ID	<mailman.6.1461222341.23626.python-list@python.org>
In reply to	#107432

On Thu, Apr 21, 2016 at 2:55 AM Vlastimil Brom <vlastimil.brom@gmail.com>
wrote:

> 2016-04-21 5:07 GMT+02:00 Steven D'Aprano <steve@pearwood.info>:
> > I want to group subsequences.
> > "ABCABCABCDEABCDEFABCABCABCB"
> > ABC ABC ABCDE ABCDE F ABC ABC ABC B
> > or:
> > ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> if I am not missing something, the latter form of grouping might be
> achieved with the following regex: [snip]
> The former one seems to be more tricky...
>

Right. If the problem is constrained to say that repeated subsequences can
have no nested repeated subsequences, it's much easier to solve.

If you had "ABCABCABCABC" should that result in
ABC ABC ABC ABC, with 4 repetitions
or ABCABC ABCABC with 2 repetitions?
In this example, one might say the higher count is obviously better, but I
think it depends on the context. Maybe the user is looking for the biggest
patterns rather than the biggest counts.

[toc] | [prev] | [next] | [standalone]

#107441

From	Alain Ketterlin <alain@universite-de-strasbourg.fr.invalid>
Date	2016-04-21 09:25 +0200
Message-ID	<87inzbjueb.fsf@universite-de-strasbourg.fr.invalid>
In reply to	#107432

Steven D'Aprano <steve@pearwood.info> writes:

> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
[...]
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
[...]
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
[...]
> How can I do this? Does this problem have a standard name and/or solution?

Looks like a tough one. I don't think it has a name (I'm not even sure
to be able to formally define it). Lets say it is an instance of
"longest repeating substring"
(https://en.wikipedia.org/wiki/Longest_repeated_substring_problem --
which really does not say much).

Anyway, it looks like a job for a suffix trees.

Depending on what you are after, you may also be interested in the
sequitur algorithm (http://www.sequitur.info/).

-- Alain.

[toc] | [prev] | [next] | [standalone]

#107444

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-04-21 09:53 +0100
Message-ID	<mailman.7.1461228807.23626.python-list@python.org>
In reply to	#107432

On 21 April 2016 at 04:07, Steven D'Aprano <steve@pearwood.info> wrote:
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("AAAABBCDDEEEFFFF"):
>     group = list(group)
>     print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B

There are some algorithms (helpfully shown in Python) here:

https://en.wikipedia.org/wiki/Cycle_detection

Note that those are for a sequence made as x[n+1] = f(x[n]) for some
function f. In your case that's just the function that gets the next
frame up/down in the call stack.

--
Oscar

--
Oscar

[toc] | [prev] | [next] | [standalone]

#107450

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-21 22:15 +1000
Message-ID	<5718c457$0$1605$c3e8da3$5496439d@news.astraweb.com>
In reply to	#107444

On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote:

> On 21 April 2016 at 04:07, Steven D'Aprano <steve@pearwood.info> wrote:
>> I want to group repeated items in a sequence. For example, I can group
>> repeated sequences of a single item at a time using groupby:
>>
>>
>> from itertools import groupby
>> for key, group in groupby("AAAABBCDDEEEFFFF"):
>>     group = list(group)
>>     print(key, "count =", len(group))
>>
>>
>> outputs:
>>
>> A count = 4
>> B count = 2
>> C count = 1
>> D count = 2
>> E count = 3
>> F count = 4
>>
>>
>> Now I want to group subsequences. For example, I have:
>>
>> "ABCABCABCDEABCDEFABCABCABCB"
>>
>> and I want to group it into repeating subsequences. I can see two ways to
>> group it:
>>
>> ABC ABC ABCDE ABCDE F ABC ABC ABC B
> 
> There are some algorithms (helpfully shown in Python) here:
> 
> https://en.wikipedia.org/wiki/Cycle_detection

It's not necessarily a cycle though. Consider a sequence of function calls:

def f(x):
    return g(x)

def g(x):
    if x < 7:
        return h(x)
    elif x < 50:
        return g(x//2)
    else:
        return x+f(x-1)

def h(x):
    raise ValueError  # oops, a bug


and a function call f(54). That will result in the chain of calls:

f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) ->
f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises

So you have that almost-cycle f calls g calls f, but it isn't periodic
because you break out of it. I'd still like to detect the repeated f->g
calls.




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#107452

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-04-21 15:01 +0100
Message-ID	<mailman.11.1461247315.23626.python-list@python.org>
In reply to	#107450

On 21 April 2016 at 13:15, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 21 Apr 2016 06:53 pm, Oscar Benjamin wrote:
>
>> On 21 April 2016 at 04:07, Steven D'Aprano <steve@pearwood.info> wrote:
>>> I want to group repeated items in a sequence. For example, I can group
>>> repeated sequences of a single item at a time using groupby:
>>>
>>>
>>> from itertools import groupby
>>> for key, group in groupby("AAAABBCDDEEEFFFF"):
>>>     group = list(group)
>>>     print(key, "count =", len(group))
>>>
>>>
>>> outputs:
>>>
>>> A count = 4
>>> B count = 2
>>> C count = 1
>>> D count = 2
>>> E count = 3
>>> F count = 4
>>>
>>>
>>> Now I want to group subsequences. For example, I have:
>>>
>>> "ABCABCABCDEABCDEFABCABCABCB"
>>>
>>> and I want to group it into repeating subsequences. I can see two ways to
>>> group it:
>>>
>>> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>>
>> There are some algorithms (helpfully shown in Python) here:
>>
>> https://en.wikipedia.org/wiki/Cycle_detection
>
> It's not necessarily a cycle though. Consider a sequence of function calls:
>
> def f(x):
>     return g(x)
>
> def g(x):
>     if x < 7:
>         return h(x)
>     elif x < 50:
>         return g(x//2)
>     else:
>         return x+f(x-1)
>
> def h(x):
>     raise ValueError  # oops, a bug
>
>
> and a function call f(54). That will result in the chain of calls:
>
> f(54) -> g(54) -> f(53) -> g(53) -> f(52) -> g(52) -> f(51) -> g(51) ->
> f(50) -> g(50) -> g(25) -> g(12) -> g(6) -> h(6) raises
>
> So you have that almost-cycle f calls g calls f, but it isn't periodic
> because you break out of it. I'd still like to detect the repeated f->g
> calls.

It doesn't matter that you break out of it. It's periodic for some
part and the algorithms listed on that page can find the cycle. In the
recursive stack overflow case what you'll usually have is

1) A few frames leading up to the start of recursion
2) A long repetitive sequence of frames
3) A few frames at the end showing how the exception was ultimately triggered.

You just need to find the cycle that makes that big long sequence.

--
Oscar

[toc] | [prev] | [next] | [standalone]

#107453

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-22 00:12 +1000
Message-ID	<mailman.12.1461247946.23626.python-list@python.org>
In reply to	#107450

On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> In the recursive stack overflow case what you'll usually have is
>
> 1) A few frames leading up to the start of recursion
> 2) A long repetitive sequence of frames
> 3) A few frames at the end showing how the exception was ultimately triggered.
>
> You just need to find the cycle that makes that big long sequence.

If the stack got overflowed, there won't usually be a part 3, as part
2 is the bit that hits sys.recursionlimit (unless increasing the
recursion limit by a finite number would solve the problem). For other
exceptions, yes, this is what you'd see.

ChrisA

[toc] | [prev] | [next] | [standalone]

#107494

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-04-23 01:00 +1000
Message-ID	<571a3c78$0$1597$c3e8da3$5496439d@news.astraweb.com>
In reply to	#107453

On Fri, 22 Apr 2016 12:12 am, Chris Angelico wrote:

> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
> <oscar.j.benjamin@gmail.com> wrote:
>> In the recursive stack overflow case what you'll usually have is
>>
>> 1) A few frames leading up to the start of recursion
>> 2) A long repetitive sequence of frames
>> 3) A few frames at the end showing how the exception was ultimately
>> triggered.
>>
>> You just need to find the cycle that makes that big long sequence.

I am not convinced that it is a cycle in the technical sense, or that the
two algorithms that Oscar linked to will necessarily find them. They may,
by chance, happen to find them sometimes, but the way the algorithms work
is by detecting periodic behaviour, and this is not periodic.

Look at the comment for Floyd's algorithm here:

https://en.wikipedia.org/wiki/Cycle_detection

    # The hare moves twice as quickly as the tortoise and
    # the distance between them increases by 1 at each step.
    # Eventually they will both be inside the cycle and then, ...

but that is violated in this scenario. The hare can escape the (non-)cycle
before the tortoise even enters it. Even if both enter the "cycle", there's
no guarantee that the termination condition `tortoise == hare` will ever be
true because the cycle doesn't loop and doesn't go back to the beginning.
Hence it is not a cycle.

> If the stack got overflowed, there won't usually be a part 3, as part 
> 2 is the bit that hits sys.recursionlimit (unless increasing the
> recursion limit by a finite number would solve the problem). 

Don't just think of breaking the recursion limit. You might be 100 calls
deep in some complex chain of function calls when something raises
ValueError. In principle, there might not even be any recursive calls at
all.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#107454

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-04-21 15:30 +0100
Message-ID	<mailman.13.1461249058.23626.python-list@python.org>
In reply to	#107450

On 21 April 2016 at 15:12, Chris Angelico <rosuav@gmail.com> wrote:
> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
> <oscar.j.benjamin@gmail.com> wrote:
>> In the recursive stack overflow case what you'll usually have is
>>
>> 1) A few frames leading up to the start of recursion
>> 2) A long repetitive sequence of frames
>> 3) A few frames at the end showing how the exception was ultimately triggered.
>>
>> You just need to find the cycle that makes that big long sequence.
>
> If the stack got overflowed, there won't usually be a part 3, as part
> 2 is the bit that hits sys.recursionlimit (unless increasing the
> recursion limit by a finite number would solve the problem). For other
> exceptions, yes, this is what you'd see.

If you have:

def f(x):
    return g(x+1)

def g(x):
    x = h(x)  # <-- stack can overflow inside here
    return f(x+1)

# etc.

So you have a long sequence that goes f, g, f, g but at the end the
stack can overflow while (not recursively) calling h leaving a small
non-cyclic part at the end.

--
Oscar

[toc] | [prev] | [next] | [standalone]

#107455

From	Chris Angelico <rosuav@gmail.com>
Date	2016-04-22 01:02 +1000
Message-ID	<mailman.16.1461250956.23626.python-list@python.org>
In reply to	#107450

On Fri, Apr 22, 2016 at 12:30 AM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
> On 21 April 2016 at 15:12, Chris Angelico <rosuav@gmail.com> wrote:
>> On Fri, Apr 22, 2016 at 12:01 AM, Oscar Benjamin
>> <oscar.j.benjamin@gmail.com> wrote:
>>> In the recursive stack overflow case what you'll usually have is
>>>
>>> 1) A few frames leading up to the start of recursion
>>> 2) A long repetitive sequence of frames
>>> 3) A few frames at the end showing how the exception was ultimately triggered.
>>>
>>> You just need to find the cycle that makes that big long sequence.
>>
>> If the stack got overflowed, there won't usually be a part 3, as part
>> 2 is the bit that hits sys.recursionlimit (unless increasing the
>> recursion limit by a finite number would solve the problem). For other
>> exceptions, yes, this is what you'd see.
>
> If you have:
>
> def f(x):
>     return g(x+1)
>
> def g(x):
>     x = h(x)  # <-- stack can overflow inside here
>     return f(x+1)
>
> # etc.
>
> So you have a long sequence that goes f, g, f, g but at the end the
> stack can overflow while (not recursively) calling h leaving a small
> non-cyclic part at the end.

Right, good point. Forgot about that. So this situation can be
triggered by anywhere up to <cycle length> lines up. Not as helpful an
optimization now. Oh well.

ChrisA

[toc] | [prev] | [next] | [standalone]

#107448

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2016-04-21 14:56 +0300
Message-ID	<mailman.10.1461239778.23626.python-list@python.org>
In reply to	#107432

On 21.04.16 06:07, Steven D'Aprano wrote:
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences.

[...]

> How can I do this? Does this problem have a standard name and/or solution?

This is a part of lossless data compression algorithms. See for example 
LZ, LZW.

[toc] | [prev] | [standalone]

csiph-web

Detecting repeated subsequences of identical items

Contents

#107432 — Detecting repeated subsequences of identical items

#107434

#107435

#107436

#107437

#107443

#107449

#107438

#107439

#107440

#107441

#107444

#107450

#107452

#107453

#107494

#107454

#107455

#107448