Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #2400 > unrolled thread
| Started by | candide <candide@free.invalid> |
|---|---|
| First post | 2011-04-01 22:54 +0200 |
| Last post | 2011-04-02 15:18 +0200 |
| Articles | 3 — 2 participants |
Back to article view | Back to comp.lang.python
Extracting repeated words candide <candide@free.invalid> - 2011-04-01 22:54 +0200
Re: Extracting repeated words Ian Kelly <ian.g.kelly@gmail.com> - 2011-04-01 16:42 -0600
Re: Extracting repeated words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-04-01 22:54 +0200 |
| Subject | Extracting repeated words |
| Message-ID | <4d963bfa$0$1584$426a34cc@news.free.fr> |
Another question relative to regular expressions.
How to extract all word duplicates in a given text by use of regular
expression methods ? To make the question concrete, if the text is
------------------
Now is better than never.
Although never is often better than *right* now.
------------------
duplicates are :
------------------------
better is now than never
------------------------
Some code can solve the question, for instance
# ------------------
import re
regexp=r"\w+"
c=re.compile(regexp, re.IGNORECASE)
text="""
Now is better than never.
Although never is often better than *right* now."""
z=[s.lower() for s in c.findall(text)]
for d in set([s for s in z if z.count(s)>1]):
print d,
# ------------------
but I'm in search of "plain" re code.
[toc] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-04-01 16:42 -0600 |
| Message-ID | <mailman.108.1301697810.2990.python-list@python.org> |
| In reply to | #2400 |
On Fri, Apr 1, 2011 at 2:54 PM, candide <candide@free.invalid> wrote: > Another question relative to regular expressions. > > How to extract all word duplicates in a given text by use of regular > expression methods ? To make the question concrete, if the text is > > ------------------ > Now is better than never. > Although never is often better than *right* now. > ------------------ > > duplicates are : > > ------------------------ > better is now than never > ------------------------ > > Some code can solve the question, for instance > > # ------------------ > import re > > regexp=r"\w+" > > c=re.compile(regexp, re.IGNORECASE) > > text=""" > Now is better than never. > Although never is often better than *right* now.""" > > z=[s.lower() for s in c.findall(text)] > > for d in set([s for s in z if z.count(s)>1]): > print d, > # ------------------ > > but I'm in search of "plain" re code. You could use a look-ahead assertion with a captured group: >>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)' >>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL) >>> c.findall(text) But note that this is computationally expensive. The regex that you posted is probably more efficient if you use a collections.Counter object instead of z.count. Cheers, Ian
[toc] | [prev] | [next] | [standalone]
| From | candide <candide@free.invalid> |
|---|---|
| Date | 2011-04-02 15:18 +0200 |
| Message-ID | <4d972274$0$4785$426a74cc@news.free.fr> |
| In reply to | #2412 |
Le 02/04/2011 00:42, Ian Kelly a écrit : > You could use a look-ahead assertion with a captured group: > >>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)' >>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL) >>>> c.findall(text) It works fine, lookahead assertions in action is what exatly i was looking for, many thanks.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web