Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #2412

Re: Extracting repeated words

References <4d963bfa$0$1584$426a34cc@news.free.fr>
From Ian Kelly <ian.g.kelly@gmail.com>
Date 2011-04-01 16:42 -0600
Subject Re: Extracting repeated words
Newsgroups comp.lang.python
Message-ID <mailman.108.1301697810.2990.python-list@python.org> (permalink)

Show all headers | View raw


On Fri, Apr 1, 2011 at 2:54 PM, candide <candide@free.invalid> wrote:
> Another question relative to regular expressions.
>
> How to extract all word duplicates in a given text by use of regular
> expression methods ?  To make the question concrete, if the text is
>
> ------------------
> Now is better than never.
> Although never is often better than *right* now.
> ------------------
>
> duplicates are :
>
> ------------------------
> better is now than never
> ------------------------
>
> Some code can solve the question, for instance
>
> # ------------------
> import re
>
> regexp=r"\w+"
>
> c=re.compile(regexp, re.IGNORECASE)
>
> text="""
> Now is better than never.
> Although never is often better than *right* now."""
>
> z=[s.lower() for s in c.findall(text)]
>
> for d in set([s for s in z if z.count(s)>1]):
>    print d,
> # ------------------
>
> but I'm in search of "plain" re code.

You could use a look-ahead assertion with a captured group:

>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>> c.findall(text)

But note that this is computationally expensive.  The regex that you
posted is probably more efficient if you use a collections.Counter
object instead of z.count.

Cheers,
Ian

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Extracting repeated words candide <candide@free.invalid> - 2011-04-01 22:54 +0200
  Re: Extracting repeated words Ian Kelly <ian.g.kelly@gmail.com> - 2011-04-01 16:42 -0600
    Re: Extracting repeated words candide <candide@free.invalid> - 2011-04-02 15:18 +0200

csiph-web