Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CAPV1RAAiD3qWqrAJYc4yb80CaHG4A9N4jz4aY_CsyJr_nAUx9Q@mail.gmail.com>
References: <CAPV1RAAiD3qWqrAJYc4yb80CaHG4A9N4jz4aY_CsyJr_nAUx9Q@mail.gmail.com>
Date: Mon, 20 Jan 2014 22:10:46 +1100
Subject: Re: regex multiple patterns in order
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.5747.1390216250.18130.python-list@python.org>
Lines: 23
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:64355

On Mon, Jan 20, 2014 at 9:44 PM, km <srikrishnamohan@gmail.com> wrote:
>>>> p = re.compile('(CAA)+?(TCT)+?(TA)+?')
>>>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
> [('CAA', 'TCT', 'TA')]
>
> But I instead find only one instance of the CAA/TCT/TA in that order.
> How can I get 3 matches of CAA, followed by  four matches of TCT followed by
> 2 matches of TA ?
> Well these patterns (CAA/TCT/TA) can occur any number of  times and atleast
> once so I have to use + in the regex.

You're capturing the single instance, not the repeated one. It is
matching against all three CAA units, but capturing just the first.
Try this:

>>> p = re.compile('((?:CAA)+)((?:TCT)+)((?:TA)+)')
>>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
[('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA')]

This groups "CAA" with non-capturing parentheses (?:regex) and then
captures that with the + around it.

ChrisA