Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #74312 > unrolled thread
| Started by | fl <rxjwg98@gmail.com> |
|---|---|
| First post | 2014-07-10 08:37 -0700 |
| Last post | 2014-07-11 08:18 -0700 |
| Articles | 13 — 11 participants |
Back to article view | Back to comp.lang.python
How to decipher :re.split(r"(\(\([^)]+\)\))" in the example fl <rxjwg98@gmail.com> - 2014-07-10 08:37 -0700
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Peter Otten <__peter__@web.de> - 2014-07-10 18:49 +0200
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example MRAB <python@mrabarnett.plus.com> - 2014-07-10 18:01 +0100
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Joel Goldstick <joel.goldstick@gmail.com> - 2014-07-10 13:05 -0400
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Albert-Jan Roskam <fomcl@yahoo.com> - 2014-07-10 12:15 -0700
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Cameron Simpson <cs@zip.com.au> - 2014-07-11 11:29 +1000
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Roy Smith <roy@panix.com> - 2014-07-10 22:18 -0400
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Tim Chase <python.list@tim.thechases.com> - 2014-07-10 21:37 -0500
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Roy Smith <roy@panix.com> - 2014-07-10 23:33 -0400
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Chris Angelico <rosuav@gmail.com> - 2014-07-11 14:31 +1000
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example alister <alister.nospam.ware@ntlworld.com> - 2014-07-11 08:00 +0000
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Steven D'Aprano <steve@pearwood.info> - 2014-07-11 09:04 +0000
Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example Albert-Jan Roskam <fomcl@yahoo.com> - 2014-07-11 08:18 -0700
| From | fl <rxjwg98@gmail.com> |
|---|---|
| Date | 2014-07-10 08:37 -0700 |
| Subject | How to decipher :re.split(r"(\(\([^)]+\)\))" in the example |
| Message-ID | <981c1f5f-2c19-4efc-8397-796bde07f39b@googlegroups.com> |
Hi,
This example is from the link:
https://wiki.python.org/moin/RegularExpression
I have thought about it quite a while without a clue yet. I notice that it uses
double quote ", in contrast to ' which I see more often until now.
It looks very complicated to me. Could you simplified it to a simple example?
Thanks,
import re
split_up = re.split(r"(\(\([^)]+\)\))",
"This is a ((test)) of the ((emergency broadcasting station.))")
...which produces:
["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-07-10 18:49 +0200 |
| Message-ID | <mailman.11733.1405010988.18130.python-list@python.org> |
| In reply to | #74312 |
fl wrote:
> Hi,
>
> This example is from the link:
>
> https://wiki.python.org/moin/RegularExpression
>
>
> I have thought about it quite a while without a clue yet. I notice that it
> uses double quote ", in contrast to ' which I see more often until now.
> It looks very complicated to me. Could you simplified it to a simple
> example?
Just break it into its components.
"(...)" in the context of re.split() keeps the delimiters while just "..."
does not. Example:
>>> re.split("a+", "abbaaababa")
['', 'bb', 'b', 'b', '']
>>> re.split("(a+)", "abbaaababa")
['', 'a', 'bb', 'aaa', 'b', 'a', 'b', 'a', '']
r"\(" matches the openening parenthesis. The "(" has to be escaped because
it otherwise has a special meaning (begin group) in a regex.
"[abc]" matches a, b, or c. A leading ^ inverts the set, so "[^abc]" matches
anything but a, b, or c. Therefore "[^)]" matches anything but the closing
parenthesis.
The complete regex then is: match two opening parens, then one or more chars
that are not closing parens, then two closing parens, and make the complete
group part of the resulting list.
PS: Note that sometimes the re.DEBUG flag may be helpful in understanding
noisy regexes:
subpattern 1
literal 40
literal 40
max_repeat 1 4294967295
not_literal 41
literal 41
literal 41
<_sre.SRE_Pattern object at 0x7f5740455c90>
> import re
> split_up = re.split(r"(\(\([^)]+\)\))",
> "This is a ((test)) of the ((emergency broadcasting
> station.))")
>
>
> ...which produces:
>
>
> ["This is a ", "((test))", " of the ", "((emergency broadcasting
> [station.))" ]
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2014-07-10 18:01 +0100 |
| Message-ID | <mailman.11735.1405011682.18130.python-list@python.org> |
| In reply to | #74312 |
On 2014-07-10 16:37, fl wrote:
> Hi,
>
> This example is from the link:
>
> https://wiki.python.org/moin/RegularExpression
>
>
> I have thought about it quite a while without a clue yet. I notice that it uses
> double quote ", in contrast to ' which I see more often until now.
> It looks very complicated to me. Could you simplified it to a simple example?
>
>
> Thanks,
>
>
>
>
>
> import re
> split_up = re.split(r"(\(\([^)]+\)\))",
> "This is a ((test)) of the ((emergency broadcasting station.))")
>
>
> ...which produces:
>
>
> ["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]
>
No it doesn't; you've omitted the final string.
The regex means:
( Start of capture group.
\( Literal "(".
\( Literal "(".
[^)]+ One or more repeats of any character except a literal ")".
\) Literal ")".
\) Literal ")".
) End of capture group.
.split returns a list of the parts of the string between the matches,
and if, as in this example, there are capture groups, then those too:
[
'This is a ', # The part before the first
# match.
'((test))', # The first match (group 1).
' of the ', # The part between the first
# and second matches.
'((emergency broadcasting station.))', # The second match.
'' # The part after the second
# match.
]
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2014-07-10 13:05 -0400 |
| Message-ID | <mailman.11736.1405011952.18130.python-list@python.org> |
| In reply to | #74312 |
On Thu, Jul 10, 2014 at 11:37 AM, fl <rxjwg98@gmail.com> wrote: > Hi, > > This example is from the link: > > https://wiki.python.org/moin/RegularExpression > > > I have thought about it quite a while without a clue yet. I notice that it uses > double quote ", in contrast to ' which I see more often until now. Double quotes or single quotes -- doesn't matter. > It looks very complicated to me. Could you simplified it to a simple example? > You might read up first here: https://docs.python.org/2/library/re.html If you are just new to learning python, regular expressions are not a good place to start. But if you insist, the page you are looking at is more of a cheat sheet . Try the python docs, and tutorial first. Or google. > > Thanks, > > > > > > import re > split_up = re.split(r"(\(\([^)]+\)\))", > "This is a ((test)) of the ((emergency broadcasting station.))") > > The outer parens are for grouping. I'm not good at regexes but it looks like it wants two open parens followed by any number of characters that are anything but a close paren, followed by two close parens. So whenever it finds that pattern it splits off what is on either side of it. > ...which produces: > > > ["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ] > -- > https://mail.python.org/mailman/listinfo/python-list -- Joel Goldstick http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Albert-Jan Roskam <fomcl@yahoo.com> |
|---|---|
| Date | 2014-07-10 12:15 -0700 |
| Message-ID | <mailman.11743.1405020017.18130.python-list@python.org> |
| In reply to | #74312 |
----- Original Message ----- > From: Joel Goldstick <joel.goldstick@gmail.com> > To: fl <rxjwg98@gmail.com> > Cc: "python-list@python.org" <python-list@python.org> > Sent: Thursday, July 10, 2014 7:05 PM > Subject: Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example > > On Thu, Jul 10, 2014 at 11:37 AM, fl <rxjwg98@gmail.com> wrote: >> Hi, >> >> This example is from the link: >> >> https://wiki.python.org/moin/RegularExpression >> >> >> I have thought about it quite a while without a clue yet. I notice that it > uses >> double quote ", in contrast to ' which I see more often until now. > > Double quotes or single quotes -- doesn't matter. > >> It looks very complicated to me. Could you simplified it to a simple > example? >> > You might read up first here: https://docs.python.org/2/library/re.html > > If you are just new to learning python, regular expressions are not a > good place to start. But if you insist, the page you are looking at > is more of a cheat sheet . The free sample chapter from Mark Summerfield's book is about regular expressions: http://www.informit.com/content/images/9780321680563/samplepages/0321680561_Sample.pdf That whole book is superb, and the regex chapter is no exception.
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2014-07-11 11:29 +1000 |
| Message-ID | <mailman.11746.1405042179.18130.python-list@python.org> |
| In reply to | #74312 |
On 10Jul2014 08:37, fl <rxjwg98@gmail.com> wrote:
>This example is from the link:
>
>https://wiki.python.org/moin/RegularExpression
>
>I have thought about it quite a while without a clue yet.
>I notice that it uses
>double quote ", in contrast to ' which I see more often until now.
With raw strings (r', r") this doesn't matter. I tend to use r' myself.
You want raw strings with regular expressions because otherwise their heavy use
of sloshes "\" overlap with Python's use of sloshes, making everything harder.
>It looks very complicated to me. Could you simplified it to a simple example?
>
>import re
>split_up = re.split(r"(\(\([^)]+\)\))",
> "This is a ((test)) of the ((emergency broadcasting station.))")
>
>...which produces:
>
>["This is a ", "((test))", " of the ", "((emergency broadcasting station.))" ]
Rip off the python punctuation and get the regexp itself:
(\(\([^)]+\)\))
then start from the inside out:
[^)] Any character except a closing bracket.
+ One or more of the preceeding.
Therefore:
[^)]+ One or more characters which are not closing brackets.
Also phrased: at least one character which is not a closing bracket.
Outside this are \( and \): these are literal opening and closing bracket
characters. So:
\(\([^)]+\)\)
Two opening brackets, then at least one character which is not a
closing bracket, then two closing brackets.
The outermost ( and ) are regexp grouping brackets, not text. On their own you
don't need them, but they mark out the regexp between them for later reference
or for use with a repeating modifier like ?, * or +. So in this instance they
do not add anything special to the regexp.
Given the above inside-to-out explaination, does that explain the re.split
result for you?
Cheers,
Cameron Simpson <cs@zip.com.au>
I thought the DoD was a bunch of licensed squids. The last thing you
need is a bunch of unregulated, amateur squids running loose.
- David Wood <davewood@teleport.com>
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-07-10 22:18 -0400 |
| Message-ID | <roy-CC0052.22182710072014@news.panix.com> |
| In reply to | #74337 |
In article <mailman.11746.1405042179.18130.python-list@python.org>,
Cameron Simpson <cs@zip.com.au> wrote:
> Outside this are \( and \): these are literal opening and closing bracket
> characters. So:
>
> \(\([^)]+\)\)
> Two opening brackets, then at least one character which is not a
> closing bracket, then two closing brackets.
This is a perfectly OK way to write this, but personally I find my eyes
start to glaze over whenever I see things like \(\(, so I would probably
have written it as \({2}. I find that a little easier to read. So, for
the whole thing up to this point:
\({2}[^)]+\){2}
although, even better would be to use to utterly awesome re.VERBOSE
flag, and write it as:
\({2} [^)]+ \){2}
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-07-10 21:37 -0500 |
| Message-ID | <mailman.11747.1405046292.18130.python-list@python.org> |
| In reply to | #74338 |
On 2014-07-10 22:18, Roy Smith wrote:
> > Outside this are \( and \): these are literal opening and closing
> > bracket characters. So:
> >
> > \(\([^)]+\)\)
>
> although, even better would be to use to utterly awesome
>> re.VERBOSE
> flag, and write it as:
>
> \({2} [^)]+ \){2}
Or heck, use a multi-line verbose expression and comment it for
clarity:
r = re.compile(r"""
( # begin a capture group
\({2} # two literal "(" characters
[^)]+ # one or more non-close-paren characters
\){2} # two literal ")" characters
) # close the capture group
""", re.VERBOSE)
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-07-10 23:33 -0400 |
| Message-ID | <roy-AAB0F1.23332710072014@news.panix.com> |
| In reply to | #74339 |
In article <mailman.11747.1405046292.18130.python-list@python.org>,
Tim Chase <python.list@tim.thechases.com> wrote:
> On 2014-07-10 22:18, Roy Smith wrote:
> > > Outside this are \( and \): these are literal opening and closing
> > > bracket characters. So:
> > >
> > > \(\([^)]+\)\)
> >
> > although, even better would be to use to utterly awesome
> >> re.VERBOSE
> > flag, and write it as:
> >
> > \({2} [^)]+ \){2}
>
> Or heck, use a multi-line verbose expression and comment it for
> clarity:
>
> r = re.compile(r"""
> ( # begin a capture group
> \({2} # two literal "(" characters
> [^)]+ # one or more non-close-paren characters
> \){2} # two literal ")" characters
> ) # close the capture group
> """, re.VERBOSE)
>
> -tkc
Ugh. That reminds me of the classic commenting anti-pattern:
l = [] # create an empty list
for i in range(10): # iterate over the first 10 integers
l.append(i) # append each one to the list
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-11 14:31 +1000 |
| Message-ID | <mailman.11748.1405053093.18130.python-list@python.org> |
| In reply to | #74340 |
On Fri, Jul 11, 2014 at 1:33 PM, Roy Smith <roy@panix.com> wrote:
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>>
>> r = re.compile(r"""
>> ( # begin a capture group
>> \({2} # two literal "(" characters
>> [^)]+ # one or more non-close-paren characters
>> \){2} # two literal ")" characters
>> ) # close the capture group
>> """, re.VERBOSE)
>>
>> -tkc
>
> Ugh. That reminds me of the classic commenting anti-pattern:
>
> l = [] # create an empty list
> for i in range(10): # iterate over the first 10 integers
> l.append(i) # append each one to the list
Small difference between the two. Python is designed to be a readable
language... regexps combine all the power and unreadability of machine
code with the portability of machine code.
ChrisA
exaggerating... but only a little
[toc] | [prev] | [next] | [standalone]
| From | alister <alister.nospam.ware@ntlworld.com> |
|---|---|
| Date | 2014-07-11 08:00 +0000 |
| Message-ID | <NQMvv.113411$G33.97625@fx32.am4> |
| In reply to | #74340 |
On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:
> In article <mailman.11747.1405046292.18130.python-list@python.org>,
> Tim Chase <python.list@tim.thechases.com> wrote:
>
>> On 2014-07-10 22:18, Roy Smith wrote:
>> > > Outside this are \( and \): these are literal opening and closing
>> > > bracket characters. So:
>> > >
>> > > \(\([^)]+\)\)
>> >
>> > although, even better would be to use to utterly awesome
>> >> re.VERBOSE
>> > flag, and write it as:
>> >
>> > \({2} [^)]+ \){2}
>>
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>>
>> r = re.compile(r"""
>> ( # begin a capture group
>> \({2} # two literal "(" characters [^)]+ # one or more
>> non-close-paren characters \){2} # two literal ")"
>> characters
>> ) # close the capture group """, re.VERBOSE)
>>
>> -tkc
>
> Ugh. That reminds me of the classic commenting anti-pattern:
>
> l = [] # create an empty list for i in range(10): #
> iterate over the first 10 integers
> l.append(i) # append each one to the list
to some extent yes, but when it comes to regexs stating "The bleedin
obvious" can be useful because as this whole thread shows it is not
always "bleedin obvious" especially after a nights sleep
--
"The identical is equal to itself, since it is different."
-- Franco Spisani
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2014-07-11 09:04 +0000 |
| Message-ID | <53bfa881$0$2746$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #74340 |
On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:
> In article <mailman.11747.1405046292.18130.python-list@python.org>,
> Tim Chase <python.list@tim.thechases.com> wrote:
>
>> On 2014-07-10 22:18, Roy Smith wrote:
>> > > Outside this are \( and \): these are literal opening and closing
>> > > bracket characters. So:
>> > >
>> > > \(\([^)]+\)\)
>> >
>> > although, even better would be to use to utterly awesome
>> >> re.VERBOSE
>> > flag, and write it as:
>> >
>> > \({2} [^)]+ \){2}
>>
>> Or heck, use a multi-line verbose expression and comment it for
>> clarity:
>>
>> r = re.compile(r"""
>> ( # begin a capture group
>> \({2} # two literal "(" characters [^)]+ # one or more
>> non-close-paren characters \){2} # two literal ")"
>> characters
>> ) # close the capture group """, re.VERBOSE)
>>
>> -tkc
>
> Ugh. That reminds me of the classic commenting anti-pattern:
The sort of dead-simple commenting shown below is not just harmless but
can be *critically important* for beginners, who otherwise may not know
what "l = []" means.
> l = [] # create an empty list
> for i in range(10): # iterate over the first 10 integers
> l.append(i) # append each one to the list
The difference is, most people get beyond that level of competence in a
matter of a few weeks or months, whereas regexes are a different story.
(1) It's possible to have spent a decade programming in Python without
ever developing more than a basic understanding of regexes. Regular
expressions are a specialist mini-language for a specialist task, and one
might go months or even *years* between needing to use them.
(2) We're *Python* programmers, not *Regex* programmers, so regular
expressions are as much a foreign language to us as Perl or Lisp or C
might be. (And if you personally read any of those languages,
congratulations. How about APL, J, REBOL, Smalltalk, Forth, or PL/I?)
(3) The syntax for regexes is painfully terse and violates a number of
import rules of good design. Larry Wall has listed no fewer than 19
problems with regex syntax/culture:
http://perl6.org/archive/doc/design/apo/A05.html
So all things considered, for the average Python programmer who has a
basic understanding of regexes but has to keep turning to the manual to
find out how to do even simple things, comments explaining what the regex
does is an excellent idea.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Albert-Jan Roskam <fomcl@yahoo.com> |
|---|---|
| Date | 2014-07-11 08:18 -0700 |
| Message-ID | <mailman.11761.1405092218.18130.python-list@python.org> |
| In reply to | #74352 |
----- Original Message -----
> From: Steven D'Aprano <steve@pearwood.info>
> To: python-list@python.org
> Cc:
> Sent: Friday, July 11, 2014 11:04 AM
> Subject: Re: How to decipher :re.split(r"(\(\([^)]+\)\))" in the example
>
> On Thu, 10 Jul 2014 23:33:27 -0400, Roy Smith wrote:
>
>> In article <mailman.11747.1405046292.18130.python-list@python.org>,
>> Tim Chase <python.list@tim.thechases.com> wrote:
>>
>>> On 2014-07-10 22:18, Roy Smith wrote:
>>> > > Outside this are \( and \): these are literal opening
> and closing
>>> > > bracket characters. So:
>>> > >
>>> > > \(\([^)]+\)\)
>>> >
>>> > although, even better would be to use to utterly awesome
>>> >> re.VERBOSE
>>> > flag, and write it as:
>>> >
>>> > \({2} [^)]+ \){2}
>>>
>>> Or heck, use a multi-line verbose expression and comment it for
>>> clarity:
>>>
>>> r = re.compile(r"""
>>> ( # begin a capture group
>>> \({2} # two literal "(" characters [^)]+
> # one or more
>>> non-close-paren characters \){2} # two literal
> ")"
>>> characters
>>> ) # close the capture group """,
> re.VERBOSE)
>>>
>>> -tkc
>>
>> Ugh. That reminds me of the classic commenting anti-pattern:
>
> The sort of dead-simple commenting shown below is not just harmless but
> can be *critically important* for beginners, who otherwise may not know
> what "l = []" means.
>
>> l = [] # create an empty list
>> for i in range(10): # iterate over the first 10 integers
>> l.append(i) # append each one to the list
>
Anything better than this hideous type of commenting: (?#...), e.g
>>> re.match("(19|20)[0-9]{2}(?#year)-[0-9]{2}(?#month)", "2010-12")
Same thing for the 'limsux' modifiers, although *maybe* they can be useful.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web