Groups > comp.lang.python > #64357 > unrolled thread

Re: regex multiple patterns in order

Started by	Ben Finney <ben+python@benfinney.id.au>
First post	2014-01-20 22:18 +1100
Last post	2014-01-20 17:33 +0000
Articles	9 — 6 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: regex multiple patterns in order Ben Finney <ben+python@benfinney.id.au> - 2014-01-20 22:18 +1100
    Re: regex multiple patterns in order Roy Smith <roy@panix.com> - 2014-01-20 09:52 -0500
      Re: regex multiple patterns in order Neil Cerutti <neilc@norwich.edu> - 2014-01-20 16:04 +0000
      Re: regex multiple patterns in order Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-20 16:16 +0000
      Re: regex multiple patterns in order Devin Jeanpierre <jeanpierreda@gmail.com> - 2014-01-20 08:40 -0800
        Re: regex multiple patterns in order Rustom Mody <rustompmody@gmail.com> - 2014-01-20 09:06 -0800
          Re: regex multiple patterns in order Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-20 17:30 +0000
      Re: regex multiple patterns in order Neil Cerutti <neilc@norwich.edu> - 2014-01-20 17:09 +0000
      Re: regex multiple patterns in order Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-20 17:33 +0000

#64357 — Re: regex multiple patterns in order

From	Ben Finney <ben+python@benfinney.id.au>
Date	2014-01-20 22:18 +1100
Subject	Re: regex multiple patterns in order
Message-ID	<mailman.5748.1390216721.18130.python-list@python.org>

km <srikrishnamohan@gmail.com> writes:

> I am trying to find sub sequence patterns but constrained by the order
> in which they occur

There are also specific resources for understanding and testing regex
patterns, such as <URL:http://www.pythonregex.com/>.

> For example
>
> >>> p = re.compile('(CAA)+?(TCT)+?(TA)+?')
> >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
> [('CAA', 'TCT', 'TA')]
>
> But I instead find only one instance of the CAA/TCT/TA in that order.

Yes, because the grouping operator (the parens ‘()’) in each case
contains exactly “CAA”, “TCT”, “TA”. If you want the repetitions to be
part of the group, you need the repetition operator (in your case, ‘+’)
to be part of the group.

> How can I get 3 matches of CAA, followed by  four matches of TCT followed
> by 2 matches of TA ?

With a little experimenting I get:

    >>> p = re.compile('((?:CAA)+)?((?:TCT)+)?((?:TA)+)?')
    >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
    [('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA'), ('', '', '')]

Remember that you'll get no more than one group returned for each group
you specify in the pattern.

> Well these patterns (CAA/TCT/TA) can occur any number of times and
> atleast once so I have to use + in the regex.

Be aware that regex is not the solution to all parsing problems; for
many parsing problems it is an attractive but inappropriate tool. You
may need to construct a more specific parser for your needs. Even if
it's possible with regex, the resulting pattern may be so complex that
it's better to write it out more explicitly.

-- 
 \     “To punish me for my contempt of authority, Fate has made me an |
  `\                   authority myself.” —Albert Einstein, 1930-09-18 |
_o__)                                                                  |
Ben Finney

[toc] | [next] | [standalone]

#64361

From	Roy Smith <roy@panix.com>
Date	2014-01-20 09:52 -0500
Message-ID	<roy-C6EC2C.09525920012014@news.panix.com>
In reply to	#64357

In article <mailman.5748.1390216721.18130.python-list@python.org>,
 Ben Finney <ben+python@benfinney.id.au> wrote:

> With a little experimenting I get:
> 
>     >>> p = re.compile('((?:CAA)+)?((?:TCT)+)?((?:TA)+)?')
>     >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
>     [('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA'), ('', '', '')]

Perhaps a matter of style, but I would have left off the ?: markers and 
done this:

p = re.compile('((CAA)+)((TCT)+)((TA)+)')
m = p.match('CAACAACAATCTTCTTCTTCTTATATA')
print m.groups()

$ python r.py
('CAACAACAA', 'CAA', 'TCTTCTTCTTCT', 'TCT', 'TATATA', 'TA')

The ?: says, "match this group, but don't save it".  The advantage of 
that is you don't get unwanted groups in your match object.  The 
disadvantage is they make the pattern more difficult to read.  My 
personal opinion is I'd rather make the pattern easier to read and just 
ignore the extra matches in the output (in this case, I want groups 0, 
2, and 4).

I also left off the outer ?s, because I think this better represents the 
intent.  The pattern '((CAA)+)?((TCT)+)?((TA)+)?' matches, for example, 
an empty string; I suspect that's not what was intended.

> Be aware that regex is not the solution to all parsing problems; for
> many parsing problems it is an attractive but inappropriate tool. You
> may need to construct a more specific parser for your needs. Even if
> it's possible with regex, the resulting pattern may be so complex that
> it's better to write it out more explicitly.

Oh, posh.

You are correct; regex is not the solution to all parsing problems, but 
it is a powerful tool which people should be encouraged to learn.  For 
some problems, it is indeed the correct tool, and this seems like one of 
them.  Discouraging people from learning about regexes is an educational 
anti-pattern which I see distressingly often on this newsgroup.

Several lives ago, I worked in a molecular biology lab writing programs 
to analyze DNA sequences.  Here's a common problem: "Find all the places 
where GACGTC or TTCGAA (or any of a similar set of 100 or so short 
patterns appear".  I can't think of an easier way to represent that in 
code than a regex.

Sure, it'll be a huge regex, which may take a long time to compile, but 
one of the nice things about these sorts of problems) is that the 
patterns you are looking for tend not to change very often.  For 
example, the problem I mention in the preceding paragraph is finding 
restriction sites, i.e. the locations where restriction enzymes will cut 
a strand of DNA.  There's a finite set of commercially available 
restriction enzymes, and that list doesn't change from month to month 
(at this point, maybe even from year to year).

For more details, see 
http://bioinformatics.oxfordjournals.org/content/4/4/459.abstract

Executive summary: I wrote my own regex compiler which was optimized for 
the types of patterns this problem required.

[toc] | [prev] | [next] | [standalone]

#64362

From	Neil Cerutti <neilc@norwich.edu>
Date	2014-01-20 16:04 +0000
Message-ID	<mailman.5750.1390233911.18130.python-list@python.org>
In reply to	#64361

On 2014-01-20, Roy Smith <roy@panix.com> wrote:
> In article
> <mailman.5748.1390216721.18130.python-list@python.org>, Ben
> Finney <ben+python@benfinney.id.au> wrote:
>> Be aware that regex is not the solution to all parsing
>> problems; for many parsing problems it is an attractive but
>> inappropriate tool. You may need to construct a more specific
>> parser for your needs. Even if it's possible with regex, the
>> resulting pattern may be so complex that it's better to write
>> it out more explicitly.
>
> Oh, posh.
>
> You are correct; regex is not the solution to all parsing
> problems, but it is a powerful tool which people should be
> encouraged to learn.  For some problems, it is indeed the
> correct tool, and this seems like one of them.  Discouraging
> people from learning about regexes is an educational
> anti-pattern which I see distressingly often on this newsgroup.

I use regular expressions regularly, for example, when editing
text with gvim. But when I want to use them in Python I have to
contend with the re module. I've never become comfortable with
it.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#64364

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-20 16:16 +0000
Message-ID	<mailman.5751.1390234616.18130.python-list@python.org>
In reply to	#64361

On 20/01/2014 16:04, Neil Cerutti wrote:
> On 2014-01-20, Roy Smith <roy@panix.com> wrote:
>> In article
>> <mailman.5748.1390216721.18130.python-list@python.org>, Ben
>> Finney <ben+python@benfinney.id.au> wrote:
>>> Be aware that regex is not the solution to all parsing
>>> problems; for many parsing problems it is an attractive but
>>> inappropriate tool. You may need to construct a more specific
>>> parser for your needs. Even if it's possible with regex, the
>>> resulting pattern may be so complex that it's better to write
>>> it out more explicitly.
>>
>> Oh, posh.
>>
>> You are correct; regex is not the solution to all parsing
>> problems, but it is a powerful tool which people should be
>> encouraged to learn.  For some problems, it is indeed the
>> correct tool, and this seems like one of them.  Discouraging
>> people from learning about regexes is an educational
>> anti-pattern which I see distressingly often on this newsgroup.
>
> I use regular expressions regularly, for example, when editing
> text with gvim. But when I want to use them in Python I have to
> contend with the re module. I've never become comfortable with
> it.
>

You don't have to, there's always the "new" regex module that's been on 
pypi for years.  Or are you saying that you'd like to use regex but 
other influences that are outside of your sphere of control prevent you 
from doing so?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#64365

From	Devin Jeanpierre <jeanpierreda@gmail.com>
Date	2014-01-20 08:40 -0800
Message-ID	<mailman.5752.1390236075.18130.python-list@python.org>
In reply to	#64361

On Mon, Jan 20, 2014 at 8:16 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> On 20/01/2014 16:04, Neil Cerutti wrote:
>> I use regular expressions regularly, for example, when editing
>> text with gvim. But when I want to use them in Python I have to
>> contend with the re module. I've never become comfortable with
>> it.
>>
>
> You don't have to, there's always the "new" regex module that's been on pypi
> for years.  Or are you saying that you'd like to use regex but other
> influences that are outside of your sphere of control prevent you from doing
> so?

I don't see any way in which someone uncomfortable with the re module
would magically find themselves perfectly at home with the regex
module. The regex module is the re module with some extra features
(and complexity), is it not?

-- Devin

[toc] | [prev] | [next] | [standalone]

#64366

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-20 09:06 -0800
Message-ID	<00b9351f-588f-4aa1-9251-e33748f45532@googlegroups.com>
In reply to	#64365

On Monday, January 20, 2014 10:10:32 PM UTC+5:30, Devin Jeanpierre wrote:
> On Mon, Jan 20, 2014 at 8:16 AM, Mark Lawrence wrote:
> > On 20/01/2014 16:04, Neil Cerutti wrote:
> >> I use regular expressions regularly, for example, when editing
> >> text with gvim. But when I want to use them in Python I have to
> >> contend with the re module. I've never become comfortable with
> >> it.
> > You don't have to, there's always the "new" regex module that's been on pypi
> > for years.  Or are you saying that you'd like to use regex but other
> > influences that are outside of your sphere of control prevent you from doing
> > so?

> I don't see any way in which someone uncomfortable with the re module
> would magically find themselves perfectly at home with the regex
> module. The regex module is the re module with some extra features
> (and complexity), is it not?

I wonder whether the re/regex modules are at fault?
Or is it that in a manual whose readability is otherwise exemplary the re pages
are a bit painful

eg reading http://docs.python.org/2/library/re.html#module-contents
the first thing one reads is compile

[toc] | [prev] | [next] | [standalone]

#64368

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-20 17:30 +0000
Message-ID	<mailman.5754.1390239060.18130.python-list@python.org>
In reply to	#64366

On 20/01/2014 17:06, Rustom Mody wrote:
> On Monday, January 20, 2014 10:10:32 PM UTC+5:30, Devin Jeanpierre wrote:
>> On Mon, Jan 20, 2014 at 8:16 AM, Mark Lawrence wrote:
>>> On 20/01/2014 16:04, Neil Cerutti wrote:
>>>> I use regular expressions regularly, for example, when editing
>>>> text with gvim. But when I want to use them in Python I have to
>>>> contend with the re module. I've never become comfortable with
>>>> it.
>>> You don't have to, there's always the "new" regex module that's been on pypi
>>> for years.  Or are you saying that you'd like to use regex but other
>>> influences that are outside of your sphere of control prevent you from doing
>>> so?
>
>> I don't see any way in which someone uncomfortable with the re module
>> would magically find themselves perfectly at home with the regex
>> module. The regex module is the re module with some extra features
>> (and complexity), is it not?
>
> I wonder whether the re/regex modules are at fault?
> Or is it that in a manual whose readability is otherwise exemplary the re pages
> are a bit painful
>
> eg reading http://docs.python.org/2/library/re.html#module-contents
> the first thing one reads is compile
>

http://docs.python.org/3/library/re.html gives "re — Regular expression 
operations" and 
http://docs.python.org/3/library/re.html#regular-expression-syntax gives 
"Regular Expression Syntax".  Are you saying that the module contents 
should come before both of these?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#64367

From	Neil Cerutti <neilc@norwich.edu>
Date	2014-01-20 17:09 +0000
Message-ID	<mailman.5753.1390237796.18130.python-list@python.org>
In reply to	#64361

On 2014-01-20, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
> On Mon, Jan 20, 2014 at 8:16 AM, Mark Lawrence
> <breamoreboy@yahoo.co.uk> wrote:
>> On 20/01/2014 16:04, Neil Cerutti wrote:
>>> I use regular expressions regularly, for example, when
>>> editing text with gvim. But when I want to use them in Python
>>> I have to contend with the re module. I've never become
>>> comfortable with it.
>>
>> You don't have to, there's always the "new" regex module
>> that's been on pypi for years.  Or are you saying that you'd
>> like to use regex but other influences that are outside of
>> your sphere of control prevent you from doing so?
>
> I don't see any way in which someone uncomfortable with the re
> module would magically find themselves perfectly at home with
> the regex module. The regex module is the re module with some
> extra features (and complexity), is it not?

It's a negative feedback loop. I'd have to use it more often than
I do to get comfortable. There's no way a library, even a really
good one, can compete with built-in syntax support. The BDFL must
have wanted it to be this way.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#64369

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2014-01-20 17:33 +0000
Message-ID	<mailman.5755.1390239306.18130.python-list@python.org>
In reply to	#64361

On 20/01/2014 17:09, Neil Cerutti wrote:
> On 2014-01-20, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
>> On Mon, Jan 20, 2014 at 8:16 AM, Mark Lawrence
>> <breamoreboy@yahoo.co.uk> wrote:
>>> On 20/01/2014 16:04, Neil Cerutti wrote:
>>>> I use regular expressions regularly, for example, when
>>>> editing text with gvim. But when I want to use them in Python
>>>> I have to contend with the re module. I've never become
>>>> comfortable with it.
>>>
>>> You don't have to, there's always the "new" regex module
>>> that's been on pypi for years.  Or are you saying that you'd
>>> like to use regex but other influences that are outside of
>>> your sphere of control prevent you from doing so?
>>
>> I don't see any way in which someone uncomfortable with the re
>> module would magically find themselves perfectly at home with
>> the regex module. The regex module is the re module with some
>> extra features (and complexity), is it not?
>
> It's a negative feedback loop. I'd have to use it more often than
> I do to get comfortable. There's no way a library, even a really
> good one, can compete with built-in syntax support. The BDFL must
> have wanted it to be this way.
>

Regex was originally scheduled to go into 3.3 and then 3.4 but not made 
it.  I assume that it will again be targeted in the 3.5 release 
schedule. Three strikes and you're out is a BDFL plan?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [standalone]

csiph-web

Re: regex multiple patterns in order

Contents

#64357 — Re: regex multiple patterns in order

#64361

#64362

#64364

#64365

#64366

#64368

#64367

#64369