Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #2412
| References | <4d963bfa$0$1584$426a34cc@news.free.fr> |
|---|---|
| From | Ian Kelly <ian.g.kelly@gmail.com> |
| Date | 2011-04-01 16:42 -0600 |
| Subject | Re: Extracting repeated words |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.108.1301697810.2990.python-list@python.org> (permalink) |
On Fri, Apr 1, 2011 at 2:54 PM, candide <candide@free.invalid> wrote: > Another question relative to regular expressions. > > How to extract all word duplicates in a given text by use of regular > expression methods ? To make the question concrete, if the text is > > ------------------ > Now is better than never. > Although never is often better than *right* now. > ------------------ > > duplicates are : > > ------------------------ > better is now than never > ------------------------ > > Some code can solve the question, for instance > > # ------------------ > import re > > regexp=r"\w+" > > c=re.compile(regexp, re.IGNORECASE) > > text=""" > Now is better than never. > Although never is often better than *right* now.""" > > z=[s.lower() for s in c.findall(text)] > > for d in set([s for s in z if z.count(s)>1]): > print d, > # ------------------ > > but I'm in search of "plain" re code. You could use a look-ahead assertion with a captured group: >>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)' >>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL) >>> c.findall(text) But note that this is computationally expensive. The regex that you posted is probably more efficient if you use a collections.Counter object instead of z.count. Cheers, Ian
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Extracting repeated words candide <candide@free.invalid> - 2011-04-01 22:54 +0200
Re: Extracting repeated words Ian Kelly <ian.g.kelly@gmail.com> - 2011-04-01 16:42 -0600
Re: Extracting repeated words candide <candide@free.invalid> - 2011-04-02 15:18 +0200
csiph-web