Groups > comp.lang.python > #2400 > unrolled thread

Extracting repeated words

Started by	candide <candide@free.invalid>
First post	2011-04-01 22:54 +0200
Last post	2011-04-02 15:18 +0200
Articles	3 — 2 participants

Back to article view | Back to comp.lang.python

  Extracting repeated words candide <candide@free.invalid> - 2011-04-01 22:54 +0200
    Re: Extracting repeated words Ian Kelly <ian.g.kelly@gmail.com> - 2011-04-01 16:42 -0600
      Re: Extracting repeated words candide <candide@free.invalid> - 2011-04-02 15:18 +0200

#2400 — Extracting repeated words

From	candide <candide@free.invalid>
Date	2011-04-01 22:54 +0200
Subject	Extracting repeated words
Message-ID	<4d963bfa$0$1584$426a34cc@news.free.fr>

Another question relative to regular expressions.

How to extract all word duplicates in a given text by use of regular 
expression methods ?  To make the question concrete, if the text is

------------------
Now is better than never.
Although never is often better than *right* now.
------------------

duplicates are :

------------------------
better is now than never
------------------------

Some code can solve the question, for instance

# ------------------
import re

regexp=r"\w+"

c=re.compile(regexp, re.IGNORECASE)

text="""
Now is better than never.
Although never is often better than *right* now."""

z=[s.lower() for s in c.findall(text)]

for d in set([s for s in z if z.count(s)>1]):
     print d,
# ------------------

but I'm in search of "plain" re code.

[toc] | [next] | [standalone]

#2412

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-04-01 16:42 -0600
Message-ID	<mailman.108.1301697810.2990.python-list@python.org>
In reply to	#2400

On Fri, Apr 1, 2011 at 2:54 PM, candide <candide@free.invalid> wrote:
> Another question relative to regular expressions.
>
> How to extract all word duplicates in a given text by use of regular
> expression methods ?  To make the question concrete, if the text is
>
> ------------------
> Now is better than never.
> Although never is often better than *right* now.
> ------------------
>
> duplicates are :
>
> ------------------------
> better is now than never
> ------------------------
>
> Some code can solve the question, for instance
>
> # ------------------
> import re
>
> regexp=r"\w+"
>
> c=re.compile(regexp, re.IGNORECASE)
>
> text="""
> Now is better than never.
> Although never is often better than *right* now."""
>
> z=[s.lower() for s in c.findall(text)]
>
> for d in set([s for s in z if z.count(s)>1]):
>    print d,
> # ------------------
>
> but I'm in search of "plain" re code.

You could use a look-ahead assertion with a captured group:

>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>> c.findall(text)

But note that this is computationally expensive.  The regex that you
posted is probably more efficient if you use a collections.Counter
object instead of z.count.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#2452

From	candide <candide@free.invalid>
Date	2011-04-02 15:18 +0200
Message-ID	<4d972274$0$4785$426a74cc@news.free.fr>
In reply to	#2412

Le 02/04/2011 00:42, Ian Kelly a écrit :

> You could use a look-ahead assertion with a captured group:
>
>>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>>> c.findall(text)

It works fine, lookahead assertions in action is what exatly i was 
looking for, many  thanks.

[toc] | [prev] | [standalone]

csiph-web

Extracting repeated words

Contents

#2400 — Extracting repeated words

#2412

#2452