Groups > comp.lang.python > #74312 > unrolled thread

How to decipher :re.split(r"($\([^)]+$\))" in the example

Started by	fl <rxjwg98@gmail.com>
First post	2014-07-10 08:37 -0700
Last post	2014-07-11 08:18 -0700
Articles	13 — 11 participants

Back to article view | Back to comp.lang.python

  How to decipher :re.split(r"(\(\([^)]+\)\))" in the example fl <rxjwg98@gmail.com> - 2014-07-10 08:37 -0700
    Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Peter Otten <__peter__@web.de> - 2014-07-10 18:49 +0200
    Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example MRAB <python@mrabarnett.plus.com> - 2014-07-10 18:01 +0100
    Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Joel Goldstick <joel.goldstick@gmail.com> - 2014-07-10 13:05 -0400
    Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Albert-Jan Roskam <fomcl@yahoo.com> - 2014-07-10 12:15 -0700
    Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Cameron Simpson <cs@zip.com.au> - 2014-07-11 11:29 +1000
      Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Roy Smith <roy@panix.com> - 2014-07-10 22:18 -0400
        Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Tim Chase <python.list@tim.thechases.com> - 2014-07-10 21:37 -0500
          Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Roy Smith <roy@panix.com> - 2014-07-10 23:33 -0400
            Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Chris Angelico <rosuav@gmail.com> - 2014-07-11 14:31 +1000
            Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example alister <alister.nospam.ware@ntlworld.com> - 2014-07-11 08:00 +0000
            Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Steven D'Aprano <steve@pearwood.info> - 2014-07-11 09:04 +0000
              Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Albert-Jan Roskam <fomcl@yahoo.com> - 2014-07-11 08:18 -0700

#74312 — How to decipher :re.split(r"($\([^)]+$\))" in the example

From	fl <rxjwg98@gmail.com>
Date	2014-07-10 08:37 -0700
Subject	How to decipher :re.split(r"($\([^)]+$\))" in the example
Message-ID	<981c1f5f-2c19-4efc-8397-796bde07f39b@googlegroups.com>

Hi,

This example is from the link:

https://wiki.python.org/moin/RegularExpression


I have thought about it quite a while without a clue yet. I notice that it uses
double quote ", in contrast to ' which I see more often until now.
It looks very complicated to me. Could you simplified it to a simple example?


Thanks,





import re
split_up = re.split(r"(\(\([^)]+\)\))",
                    "This is a ((test)) of the ((emergency broadcasting station.))")


...which produces:


["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]

[toc] | [next] | [standalone]

#74316

From	Peter Otten <__peter__@web.de>
Date	2014-07-10 18:49 +0200
Message-ID	<mailman.11733.1405010988.18130.python-list@python.org>
In reply to	#74312

fl wrote:

> Hi,
> 
> This example is from the link:
> 
> https://wiki.python.org/moin/RegularExpression
> 
> 
> I have thought about it quite a while without a clue yet. I notice that it
> uses double quote ", in contrast to ' which I see more often until now.
> It looks very complicated to me. Could you simplified it to a simple
> example?

Just break it into its components.

"(...)" in the context of re.split() keeps the delimiters while just "..." 
does not. Example:

>>> re.split("a+", "abbaaababa")
['', 'bb', 'b', 'b', '']
>>> re.split("(a+)", "abbaaababa")
['', 'a', 'bb', 'aaa', 'b', 'a', 'b', 'a', '']

r"\(" matches the openening parenthesis. The "(" has to be escaped because 
it otherwise has a special meaning (begin group) in a regex.

"[abc]" matches a, b, or c. A leading ^ inverts the set, so "[^abc]" matches 
anything but a, b, or c. Therefore "[^)]" matches anything but the closing 
parenthesis.

The complete regex then is: match two opening parens, then one or more chars 
that are not closing parens, then two closing parens, and make the complete 
group part of the resulting list.

PS: Note that sometimes the re.DEBUG flag may be helpful in understanding 
noisy regexes:

subpattern 1 
  literal 40 
  literal 40 
  max_repeat 1 4294967295 
    not_literal 41 
  literal 41 
  literal 41 
<_sre.SRE_Pattern object at 0x7f5740455c90>

> import re
> split_up = re.split(r"(\(\([^)]+\)\))",
>                     "This is a ((test)) of the ((emergency broadcasting
>                     station.))")
> 
> 
> ...which produces:
> 
> 
> ["This is a ", "((test))", " of the ", "((emergency broadcasting
> [station.))" ]

[toc] | [prev] | [next] | [standalone]

#74318

From	MRAB <python@mrabarnett.plus.com>
Date	2014-07-10 18:01 +0100
Message-ID	<mailman.11735.1405011682.18130.python-list@python.org>
In reply to	#74312

On 2014-07-10 16:37, fl wrote:
> Hi,
>
> This example is from the link:
>
> https://wiki.python.org/moin/RegularExpression
>
>
> I have thought about it quite a while without a clue yet. I notice that it uses
> double quote ", in contrast to ' which I see more often until now.
> It looks very complicated to me. Could you simplified it to a simple example?
>
>
> Thanks,
>
>
>
>
>
> import re
> split_up = re.split(r"(\(\([^)]+\)\))",
>                      "This is a ((test)) of the ((emergency broadcasting station.))")
>
>
> ...which produces:
>
>
> ["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]
>
No it doesn't; you've omitted the final string.

The regex means:

(        Start of capture group.
\(       Literal "(".
\(       Literal "(".
[^)]+    One or more repeats of any character except a literal ")".
\)       Literal ")".
\)       Literal ")".
)        End of capture group.

.split returns a list of the parts of the string between the matches, 
and if, as in this example, there are capture groups, then those too:

[
'This is a ',                             # The part before the first
                                           # match.
'((test))',                               # The first match (group 1).
' of the ',                               # The part between the first
                                           # and second matches.
'((emergency broadcasting station.))',    # The second match.
''                                        # The part after the second
                                           # match.
]

[toc] | [prev] | [next] | [standalone]

#74320

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2014-07-10 13:05 -0400
Message-ID	<mailman.11736.1405011952.18130.python-list@python.org>
In reply to	#74312

On Thu, Jul 10, 2014 at 11:37 AM, fl <rxjwg98@gmail.com> wrote:
> Hi,
>
> This example is from the link:
>
> https://wiki.python.org/moin/RegularExpression
>
>
> I have thought about it quite a while without a clue yet. I notice that it uses
> double quote ", in contrast to ' which I see more often until now.

Double quotes or single quotes -- doesn't matter.

> It looks very complicated to me. Could you simplified it to a simple example?
>
You might read up first here: https://docs.python.org/2/library/re.html

If you are just new to learning python, regular expressions are not a
good place to start.  But if you insist, the page you are looking at
is more of a cheat sheet .

Try the python docs, and tutorial first.  Or google.


>
> Thanks,
>
>
>
>
>
> import re
> split_up = re.split(r"(\(\([^)]+\)\))",
>                     "This is a ((test)) of the ((emergency broadcasting station.))")
>
>
The outer parens are for grouping.  I'm not good at regexes but it
looks like it wants two open parens followed by any number of
characters that are anything but a close paren, followed by two close
parens.  So whenever it finds that pattern it splits off what is on
either side of it.


> ...which produces:
>
>
> ["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]
> --
> https://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#74330

From	Albert-Jan Roskam <fomcl@yahoo.com>
Date	2014-07-10 12:15 -0700
Message-ID	<mailman.11743.1405020017.18130.python-list@python.org>
In reply to	#74312


----- Original Message -----

> From: Joel Goldstick <joel.goldstick@gmail.com>
> To: fl <rxjwg98@gmail.com>
> Cc: "python-list@python.org" <python-list@python.org>
> Sent: Thursday, July 10, 2014 7:05 PM
> Subject: Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example
> 
> On Thu, Jul 10, 2014 at 11:37 AM, fl <rxjwg98@gmail.com> wrote:
>>  Hi,
>> 
>>  This example is from the link:
>> 
>>  https://wiki.python.org/moin/RegularExpression
>> 
>> 
>>  I have thought about it quite a while without a clue yet. I notice that it 
> uses
>>  double quote ", in contrast to ' which I see more often until now.
> 
> Double quotes or single quotes -- doesn't matter.
> 
>>  It looks very complicated to me. Could you simplified it to a simple 
> example?
>> 
> You might read up first here: https://docs.python.org/2/library/re.html
> 
> If you are just new to learning python, regular expressions are not a
> good place to start.  But if you insist, the page you are looking at
> is more of a cheat sheet .

The free sample chapter from Mark Summerfield's book is about regular expressions:
http://www.informit.com/content/images/9780321680563/samplepages/0321680561_Sample.pdf
That whole book is superb, and the regex chapter is no exception.

[toc] | [prev] | [next] | [standalone]

#74337

From	Cameron Simpson <cs@zip.com.au>
Date	2014-07-11 11:29 +1000
Message-ID	<mailman.11746.1405042179.18130.python-list@python.org>
In reply to	#74312

On 10Jul2014 08:37, fl <rxjwg98@gmail.com> wrote:
>This example is from the link:
>
>https://wiki.python.org/moin/RegularExpression
>
>I have thought about it quite a while without a clue yet.
>I notice that it uses
>double quote ", in contrast to ' which I see more often until now.

With raw strings (r', r") this doesn't matter. I tend to use r' myself.

You want raw strings with regular expressions because otherwise their heavy use 
of sloshes "\" overlap with Python's use of sloshes, making everything harder.

>It looks very complicated to me. Could you simplified it to a simple example?
>
>import re
>split_up = re.split(r"(\(\([^)]+\)\))",
>                    "This is a ((test)) of the ((emergency broadcasting station.))")
>
>...which produces:
>
>["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]

Rip off the python punctuation and get the regexp itself:

   (\(\([^)]+\)\))

then start from the inside out:

   [^)]  Any character except a closing bracket.
   +     One or more of the preceeding.

Therefore:

   [^)]+ One or more characters which are not closing brackets.
         Also phrased: at least one character which is not a closing bracket.

Outside this are \( and \): these are literal opening and closing bracket 
characters. So:

   \(\([^)]+\)\)
         Two opening brackets, then at least one character which is not a 
         closing bracket, then two closing brackets.

The outermost ( and ) are regexp grouping brackets, not text. On their own you 
don't need them, but they mark out the regexp between them for later reference 
or for use with a repeating modifier like ?, * or +. So in this instance they 
do not add anything special to the regexp.

Given the above inside-to-out explaination, does that explain the re.split 
result for you?

Cheers,
Cameron Simpson <cs@zip.com.au>

I thought the DoD was a bunch of licensed squids. The last thing you
need is a bunch of unregulated, amateur squids running loose.
         - David Wood <davewood@teleport.com>

[toc] | [prev] | [next] | [standalone]

#74338

From	Roy Smith <roy@panix.com>
Date	2014-07-10 22:18 -0400
Message-ID	<roy-CC0052.22182710072014@news.panix.com>
In reply to	#74337

In article <mailman.11746.1405042179.18130.python-list@python.org>,
 Cameron Simpson <cs@zip.com.au> wrote:

> Outside this are \( and \): these are literal opening and closing bracket 
> characters. So:
> 
>    \(\([^)]+\)\)
>          Two opening brackets, then at least one character which is not a 
>          closing bracket, then two closing brackets.

This is a perfectly OK way to write this, but personally I find my eyes 
start to glaze over whenever I see things like \(\(, so I would probably 
have written it as \({2}.  I find that a little easier to read.  So, for 
the whole thing up to this point:

     \({2}[^)]+\){2}

although, even better would be to use to utterly awesome re.VERBOSE 
flag, and write it as:

     \({2} [^)]+ \){2}

[toc] | [prev] | [next] | [standalone]

#74339

From	Tim Chase <python.list@tim.thechases.com>
Date	2014-07-10 21:37 -0500
Message-ID	<mailman.11747.1405046292.18130.python-list@python.org>
In reply to	#74338

On 2014-07-10 22:18, Roy Smith wrote:
> > Outside this are \( and \): these are literal opening and closing
> > bracket characters. So:
> > 
> >    \(\([^)]+\)\)
>
> although, even better would be to use to utterly awesome
>> re.VERBOSE 
> flag, and write it as:
> 
>      \({2} [^)]+ \){2}

Or heck, use a multi-line verbose expression and comment it for
clarity:

  r = re.compile(r"""
    (            # begin a capture group
     \({2}       # two literal "(" characters
     [^)]+       # one or more non-close-paren characters
     \){2}       # two literal ")" characters
    )            # close the capture group
    """, re.VERBOSE)

-tkc

[toc] | [prev] | [next] | [standalone]

#74340

From	Roy Smith <roy@panix.com>
Date	2014-07-10 23:33 -0400
Message-ID	<roy-AAB0F1.23332710072014@news.panix.com>
In reply to	#74339

In article <mailman.11747.1405046292.18130.python-list@python.org>,
 Tim Chase <python.list@tim.thechases.com> wrote:

> On 2014-07-10 22:18, Roy Smith wrote:
> > > Outside this are \( and \): these are literal opening and closing
> > > bracket characters. So:
> > > 
> > >    \(\([^)]+\)\)
> >
> > although, even better would be to use to utterly awesome
> >> re.VERBOSE 
> > flag, and write it as:
> > 
> >      \({2} [^)]+ \){2}
> 
> Or heck, use a multi-line verbose expression and comment it for
> clarity:
> 
>   r = re.compile(r"""
>     (            # begin a capture group
>      \({2}       # two literal "(" characters
>      [^)]+       # one or more non-close-paren characters
>      \){2}       # two literal ")" characters
>     )            # close the capture group
>     """, re.VERBOSE)
> 
> -tkc

Ugh.  That reminds me of the classic commenting anti-pattern:

l = []                  # create an empty list
for i in range(10):     # iterate over the first 10 integers
    l.append(i)         # append each one to the list

[toc] | [prev] | [next] | [standalone]

#74341

From	Chris Angelico <rosuav@gmail.com>
Date	2014-07-11 14:31 +1000
Message-ID	<mailman.11748.1405053093.18130.python-list@python.org>
In reply to	#74340

On Fri, Jul 11, 2014 at 1:33 PM, Roy Smith <roy@panix.com> wrote:
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>>
>>   r = re.compile(r"""
>>     (            # begin a capture group
>>      \({2}       # two literal "(" characters
>>      [^)]+       # one or more non-close-paren characters
>>      \){2}       # two literal ")" characters
>>     )            # close the capture group
>>     """, re.VERBOSE)
>>
>> -tkc
>
> Ugh.  That reminds me of the classic commenting anti-pattern:
>
> l = []                  # create an empty list
> for i in range(10):     # iterate over the first 10 integers
>     l.append(i)         # append each one to the list

Small difference between the two. Python is designed to be a readable
language... regexps combine all the power and unreadability of machine
code with the portability of machine code.

ChrisA
exaggerating... but only a little

[toc] | [prev] | [next] | [standalone]

#74344

From	alister <alister.nospam.ware@ntlworld.com>
Date	2014-07-11 08:00 +0000
Message-ID	<NQMvv.113411$G33.97625@fx32.am4>
In reply to	#74340

On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:

> In article <mailman.11747.1405046292.18130.python-list@python.org>,
>  Tim Chase <python.list@tim.thechases.com> wrote:
> 
>> On 2014-07-10 22:18, Roy Smith wrote:
>> > > Outside this are \( and \): these are literal opening and closing
>> > > bracket characters. So:
>> > > 
>> > >    \(\([^)]+\)\)
>> >
>> > although, even better would be to use to utterly awesome
>> >> re.VERBOSE
>> > flag, and write it as:
>> > 
>> >      \({2} [^)]+ \){2}
>> 
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>> 
>>   r = re.compile(r"""
>>     (            # begin a capture group
>>      \({2}       # two literal "(" characters [^)]+       # one or more
>>      non-close-paren characters \){2}       # two literal ")"
>>      characters
>>     )            # close the capture group """, re.VERBOSE)
>> 
>> -tkc
> 
> Ugh.  That reminds me of the classic commenting anti-pattern:
> 
> l = []                  # create an empty list for i in range(10):     #
> iterate over the first 10 integers
>     l.append(i)         # append each one to the list


to some extent yes, but when it comes to regexs stating "The bleedin 
obvious" can be useful because as this whole thread shows it is not 
always "bleedin obvious" especially after a nights sleep


-- 
"The identical is equal to itself, since it is different."
		-- Franco Spisani

[toc] | [prev] | [next] | [standalone]

#74352

From	Steven D'Aprano <steve@pearwood.info>
Date	2014-07-11 09:04 +0000
Message-ID	<53bfa881$0$2746$c3e8da3$76491128@news.astraweb.com>
In reply to	#74340

On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:

> In article <mailman.11747.1405046292.18130.python-list@python.org>,
>  Tim Chase <python.list@tim.thechases.com> wrote:
> 
>> On 2014-07-10 22:18, Roy Smith wrote:
>> > > Outside this are \( and \): these are literal opening and closing
>> > > bracket characters. So:
>> > > 
>> > >    \(\([^)]+\)\)
>> >
>> > although, even better would be to use to utterly awesome
>> >> re.VERBOSE
>> > flag, and write it as:
>> > 
>> >      \({2} [^)]+ \){2}
>> 
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>> 
>>   r = re.compile(r"""
>>     (            # begin a capture group
>>      \({2}       # two literal "(" characters [^)]+       # one or more
>>      non-close-paren characters \){2}       # two literal ")"
>>      characters
>>     )            # close the capture group """, re.VERBOSE)
>> 
>> -tkc
> 
> Ugh.  That reminds me of the classic commenting anti-pattern:

The sort of dead-simple commenting shown below is not just harmless but 
can be *critically important* for beginners, who otherwise may not know 
what "l = []" means.

> l = []                  # create an empty list 
> for i in range(10):     # iterate over the first 10 integers
>     l.append(i)         # append each one to the list

The difference is, most people get beyond that level of competence in a 
matter of a few weeks or months, whereas regexes are a different story. 

(1) It's possible to have spent a decade programming in Python without 
ever developing more than a basic understanding of regexes. Regular 
expressions are a specialist mini-language for a specialist task, and one 
might go months or even *years* between needing to use them.

(2) We're *Python* programmers, not *Regex* programmers, so regular 
expressions are as much a foreign language to us as Perl or Lisp or C 
might be. (And if you personally read any of those languages, 
congratulations. How about APL, J, REBOL, Smalltalk, Forth, or PL/I?)

(3) The syntax for regexes is painfully terse and violates a number of 
import rules of good design. Larry Wall has listed no fewer than 19 
problems with regex syntax/culture:

http://perl6.org/archive/doc/design/apo/A05.html

So all things considered, for the average Python programmer who has a 
basic understanding of regexes but has to keep turning to the manual to 
find out how to do even simple things, comments explaining what the regex 
does is an excellent idea.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#74359

From	Albert-Jan Roskam <fomcl@yahoo.com>
Date	2014-07-11 08:18 -0700
Message-ID	<mailman.11761.1405092218.18130.python-list@python.org>
In reply to	#74352


----- Original Message -----

> From: Steven D'Aprano <steve@pearwood.info>
> To: python-list@python.org
> Cc: 
> Sent: Friday, July 11, 2014 11:04 AM
> Subject: Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example
> 
> On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:
> 
>>  In article <mailman.11747.1405046292.18130.python-list@python.org>,
>>   Tim Chase <python.list@tim.thechases.com> wrote:
>> 
>>>  On 2014-07-10 22:18, Roy Smith wrote:
>>>  > > Outside this are \( and \): these are literal opening 
> and closing
>>>  > > bracket characters. So:
>>>  > > 
>>>  > >    \(\([^)]+\)\)
>>>  >
>>>  > although, even better would be to use to utterly awesome
>>>  >> re.VERBOSE
>>>  > flag, and write it as:
>>>  > 
>>>  >      \({2} [^)]+ \){2}
>>> 
>>>  Or heck, use a multi-line verbose expression and comment it for
>>>  clarity:
>>> 
>>>    r = re.compile(r"""
>>>      (            # begin a capture group
>>>       \({2}       # two literal "(" characters [^)]+  
>      # one or more
>>>       non-close-paren characters \){2}       # two literal 
> ")"
>>>       characters
>>>      )            # close the capture group """, 
> re.VERBOSE)
>>> 
>>>  -tkc
>> 
>>  Ugh.  That reminds me of the classic commenting anti-pattern:
> 
> The sort of dead-simple commenting shown below is not just harmless but 
> can be *critically important* for beginners, who otherwise may not know 
> what "l = []" means.
> 
>>  l = []                  # create an empty list 
>>  for i in range(10):     # iterate over the first 10 integers
>>      l.append(i)         # append each one to the list
> 

Anything better than this hideous type of commenting: (?#...), e.g
>>> re.match("(19|20)[0-9]{2}(?#year)-[0-9]{2}(?#month)", "2010-12")
Same thing for the 'limsux' modifiers, although *maybe* they can be useful.

[toc] | [prev] | [standalone]

csiph-web

How to decipher :re.split(r"(\(\([^)]+\)\))" in the example

Contents

#74312 — How to decipher :re.split(r"(\(\([^)]+\)\))" in the example

#74316

#74318

#74320

#74330

#74337

#74338

#74339

#74340

#74341

#74344

#74352

#74359