Groups > comp.lang.python > #2322 > unrolled thread

Extracting subsequences composed of the same character

Started by	candide <candide@free.invalid>
First post	2011-04-01 02:43 +0200
Last post	2011-04-01 21:39 +0200
Articles	7 — 5 participants

Back to article view | Back to comp.lang.python

  Extracting subsequences composed of the same character candide <candide@free.invalid> - 2011-04-01 02:43 +0200
    Re: Extracting subsequences composed of the same character MRAB <python@mrabarnett.plus.com> - 2011-04-01 02:16 +0100
    Re: Extracting subsequences composed of the same character Roy Smith <roy@panix.com> - 2011-03-31 21:40 -0400
    Re: Extracting subsequences composed of the same character Tim Chase <python.list@tim.thechases.com> - 2011-03-31 20:58 -0500
    Re: Extracting subsequences composed of the same character Tim Chase <python.list@tim.thechases.com> - 2011-03-31 21:20 -0500
    Re: Extracting subsequences composed of the same character Terry Reedy <tjreedy@udel.edu> - 2011-04-01 00:18 -0400
    Re: Extracting subsequences composed of the same character candide <candide@free.invalid> - 2011-04-01 21:39 +0200

#2322 — Extracting subsequences composed of the same character

From	candide <candide@free.invalid>
Date	2011-04-01 02:43 +0200
Subject	Extracting subsequences composed of the same character
Message-ID	<4d952008$0$3943$426a74cc@news.free.fr>

Suppose you have a string, for instance

"pyyythhooonnn ---> ++++"

and you search for the subquences composed of the same character, here 
you get :

'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

It's not difficult to write a Python code that solves the problem, for 
instance :

def f(text):
     ch=text
     r=[]
     if not text:
         return r
     else:
         x=ch[0]
         i=0
         for c in ch:
             if c!=x:
                 if i>1:
                     r+=[x*i]
                 x=c
                 i=1
             else:
                 i+=1
     return r+(i>1)*[i*x]

print f("pyyythhooonnn ---> ++++")


I should confess that this code is rather cumbersome so I was looking 
for an alternative. I imagine that a regular expressions approach could 
provide a better method. Does a such code exist ?  Note that the string 
is not restricted to the ascii charset.

[toc] | [next] | [standalone]

#2326

From	MRAB <python@mrabarnett.plus.com>
Date	2011-04-01 02:16 +0100
Message-ID	<mailman.59.1301620676.2990.python-list@python.org>
In reply to	#2322

On 01/04/2011 01:43, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
>
> It's not difficult to write a Python code that solves the problem, for
> instance :
>
[snip]
>
> I should confess that this code is rather cumbersome so I was looking
> for an alternative. I imagine that a regular expressions approach could
> provide a better method. Does a such code exist ? Note that the string
> is not restricted to the ascii charset.

 >>> import re
 >>> re.findall(r"((.)\2+)", s)
[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', '-'), 
('++++', '+')]
 >>> [m[0] for m in re.findall(r"((.)\2+)", s)]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

[toc] | [prev] | [next] | [standalone]

#2327

From	Roy Smith <roy@panix.com>
Date	2011-03-31 21:40 -0400
Message-ID	<roy-7F3220.21403831032011@news.panix.com>
In reply to	#2322

In article <4d952008$0$3943$426a74cc@news.free.fr>,
 candide <candide@free.invalid> wrote:

> Suppose you have a string, for instance
> 
> "pyyythhooonnn ---> ++++"
> 
> and you search for the subquences composed of the same character, here 
> you get :
> 
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

I got the following. It's O(n) (with the minor exception that the string 
addition isn't, but that's trivial to fix, and in practice, the bunches 
are short enough it hardly matters).

#!/usr/bin/env python                                                                               

s = "pyyythhooonnn ---> ++++"
answer = ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

last = None
bunches = []
bunch = ''
for c in s:
    if c == last:
        bunch += c
    else:
        if bunch:
            bunches.append(bunch)
        bunch = c
        last = c
bunches.append(bunch)

multiples = [bunch for bunch in bunches if len(bunch) > 1]
print multiples
assert(multiples == answer)


[eagerly awaiting a PEP for collections.bunch and 
collections.frozenbunch]

[toc] | [prev] | [next] | [standalone]

#2329

From	Tim Chase <python.list@tim.thechases.com>
Date	2011-03-31 20:58 -0500
Message-ID	<mailman.61.1301623137.2990.python-list@python.org>
In reply to	#2322

On 03/31/2011 07:43 PM, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn --->  ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

 >>> import re
 >>> s = "pyyythhooonnn ---> ++++"
 >>> [m.group(0) for m in re.finditer(r"(.)\1+", s)]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
 >>> [(m.group(0),m.group(1)) for m in re.finditer(r"(.)\1+", s)]
[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', 
'-'), ('++++', '+')]

-tkc

[toc] | [prev] | [next] | [standalone]

#2330

From	Tim Chase <python.list@tim.thechases.com>
Date	2011-03-31 21:20 -0500
Message-ID	<mailman.62.1301624436.2990.python-list@python.org>
In reply to	#2322

On 03/31/2011 07:43 PM, candide wrote:
> "pyyythhooonnn --->  ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

Or, if you want to do it with itertools instead of the "re" module:

 >>> s = "pyyythhooonnn ---> ++++"
 >>> from itertools import groupby
 >>> [c*length for c, length in ((k, len(list(g))) for k, g in 
groupby(s)) if length > 1]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']


-tkc

[toc] | [prev] | [next] | [standalone]

#2333

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-04-01 00:18 -0400
Message-ID	<mailman.63.1301631548.2990.python-list@python.org>
In reply to	#2322

On 3/31/2011 10:20 PM, Tim Chase wrote:
> On 03/31/2011 07:43 PM, candide wrote:
>> "pyyythhooonnn ---> ++++"
>>
>> and you search for the subquences composed of the same character, here
>> you get :
>>
>> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
>
> Or, if you want to do it with itertools instead of the "re" module:
>
>  >>> s = "pyyythhooonnn ---> ++++"
>  >>> from itertools import groupby
>  >>> [c*length for c, length in ((k, len(list(g))) for k, g in
> groupby(s)) if length > 1]
> ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

Slightly shorter:
[r for r in (''.join(g) for k, g in groupby(s)) if len(r) > 1]

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#2389

From	candide <candide@free.invalid>
Date	2011-04-01 21:39 +0200
Message-ID	<4d962a3e$0$14990$426a34cc@news.free.fr>
In reply to	#2322

Thanks, yours responses  gave me the opportunity to understand the 
"backreference" feature, it was not clear in spite of my intensive study 
of the well known RE howto manual.

[toc] | [prev] | [standalone]

csiph-web

Extracting subsequences composed of the same character

Contents

#2322 — Extracting subsequences composed of the same character

#2326

#2327

#2329

#2330

#2333

#2389