Groups > comp.lang.python > #27282 > unrolled thread

Regex Question

Started by	Frank Koshti <frank.koshti@gmail.com>
First post	2012-08-17 21:41 -0700
Last post	2012-08-18 13:30 -0400
Articles	14 — 10 participants

Back to article view | Back to comp.lang.python

  Regex Question Frank Koshti <frank.koshti@gmail.com> - 2012-08-17 21:41 -0700
    Re: Regex Question Chris Angelico <rosuav@gmail.com> - 2012-08-18 15:42 +1000
    Re: Regex Question Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-18 11:50 +0100
    Re: Regex Question Roy Smith <roy@panix.com> - 2012-08-18 09:08 -0400
      Re: Regex Question Frank Koshti <frank.koshti@gmail.com> - 2012-08-18 07:21 -0700
    Re: Regex Question Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-18 14:22 +0000
      Re: Regex Question Frank Koshti <frank.koshti@gmail.com> - 2012-08-18 07:53 -0700
        Re: Regex Question Peter Otten <__peter__@web.de> - 2012-08-18 17:48 +0200
          Re: Regex Question Frank Koshti <frank.koshti@gmail.com> - 2012-08-18 08:56 -0700
        Re: Regex Question Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-08-18 17:50 +0200
        Re: Regex Question Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-08-18 19:22 +0300
          Re: Regex Question Frank Koshti <frank.koshti@gmail.com> - 2012-08-18 13:18 -0700
      Re: Regex Question python@bdurham.com - 2012-08-18 12:36 -0400
    Re: Regex Question Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-18 13:30 -0400

#27282 — Regex Question

From	Frank Koshti <frank.koshti@gmail.com>
Date	2012-08-17 21:41 -0700
Subject	Regex Question
Message-ID	<385e732e-1c02-4dd0-ab12-b92890bbed66@o3g2000yqp.googlegroups.com>

Hi,

I'm new to regular expressions. I want to be able to match for tokens
with all their properties in the following examples. I would
appreciate some direction on how to proceed.


<h1>@foo1</h1>
<p>@foo2()</p>
<p>@foo3(anything could go here)</p>


Thanks-
Frank

[toc] | [next] | [standalone]

#27284

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-18 15:42 +1000
Message-ID	<mailman.3442.1345268549.4697.python-list@python.org>
In reply to	#27282

On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti <frank.koshti@gmail.com> wrote:
> Hi,
>
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
>
>
> <h1>@foo1</h1>
> <p>@foo2()</p>
> <p>@foo3(anything could go here)</p>

You can find regular expression primers all over the internet - fire
up your favorite search engine and type those three words in. But it
may be that what you want here is a more flexible parser; have you
looked at BeautifulSoup (so rich and green)?

ChrisA

[toc] | [prev] | [next] | [standalone]

#27290

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-18 11:50 +0100
Message-ID	<mailman.3445.1345286931.4697.python-list@python.org>
In reply to	#27282

On 18/08/2012 06:42, Chris Angelico wrote:
> On Sat, Aug 18, 2012 at 2:41 PM, Frank Koshti <frank.koshti@gmail.com> wrote:
>> Hi,
>>
>> I'm new to regular expressions. I want to be able to match for tokens
>> with all their properties in the following examples. I would
>> appreciate some direction on how to proceed.
>>
>>
>> <h1>@foo1</h1>
>> <p>@foo2()</p>
>> <p>@foo3(anything could go here)</p>
>
> You can find regular expression primers all over the internet - fire
> up your favorite search engine and type those three words in. But it
> may be that what you want here is a more flexible parser; have you
> looked at BeautifulSoup (so rich and green)?
>
> ChrisA
>

Totally agree with the sentiment.  There's a comparison of python 
parsers here http://nedbatchelder.com/text/python-parsers.html

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27292

From	Roy Smith <roy@panix.com>
Date	2012-08-18 09:08 -0400
Message-ID	<roy-AF4B44.09081118082012@news.panix.com>
In reply to	#27282

In article 
<385e732e-1c02-4dd0-ab12-b92890bbed66@o3g2000yqp.googlegroups.com>,
 Frank Koshti <frank.koshti@gmail.com> wrote:

> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
> 
> 
> <h1>@foo1</h1>
> <p>@foo2()</p>
> <p>@foo3(anything could go here)</p>

Don't try to parse HTML with regexes.  Use a real HTML parser, such as 
lxml (http://lxml.de/).

[toc] | [prev] | [next] | [standalone]

#27293

From	Frank Koshti <frank.koshti@gmail.com>
Date	2012-08-18 07:21 -0700
Message-ID	<c334103a-42cf-40c9-8ece-2e433d1ad38a@t18g2000yqi.googlegroups.com>
In reply to	#27292

I think the point was missed. I don't want to use an XML parser. The
point is to pick up those tokens, and yes I've done my share of RTFM.
This is what I've come up with:

'\$\w*\(?.*?\)'

Which doesn't work well on the above example, which is partly why I
reached out to the group. Can anyone help me with the regex?

Thanks,
Frank

[toc] | [prev] | [next] | [standalone]

#27294

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-18 14:22 +0000
Message-ID	<502fa524$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27282

On Fri, 17 Aug 2012 21:41:07 -0700, Frank Koshti wrote:

> Hi,
> 
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would appreciate
> some direction on how to proceed.

Others have already given you excellent advice to NOT use regular 
expressions to parse HTML files, but to use a proper HTML parser instead.

However, since I remember how hard it was to get started with regexes, 
I'm going to ignore that advice and show you how to abuse regexes to 
search for text, and pretend that they aren't HTML tags.

Here's your string you want to search for:

> <h1>@foo1</h1>

You want to find a piece of text that starts with "<h1>@", followed by 
any alphanumeric characters, followed by "</h1>".

We start by compiling a regex:

import re
pattern = r"<h1>@\w+</h1>"
regex = re.compile(pattern, re.I)

First we import the re module. Then we define a pattern string. Note that 
I use a "raw string" instead of a regular string -- this is not 
compulsory, but it is very common.

The difference between a raw string and a regular string is how they 
handle backslashes. In Python, some (but not all!) backslashes are 
special. For example, the regular string "\n" is not two characters, 
backslash-n, but a single character, Newline. The Python string parser 
converts backslash combinations as special characters, e.g.:

\n => newline
\t => tab
\0 => ASCII Null character
\\ => a single backslash
etc.

We often call these "backslash escapes".

Regular expressions use a lot of backslashes, and so it is useful to 
disable the interpretation of backlash escapes when writing regex 
patterns. We do that with a "raw string" -- if you prefix the string with 
the letter r, the string is raw and backslash-escapes are ignored:

# ordinary "cooked" string:
"abc\n" => a b c newline

# raw string
r"abc\n" => a b c backslash n

Here is our pattern again:

pattern = r"<h1>@\w+</h1>"

which is thirteen characters:

less-than h 1 greater-than at-sign backslash w plus-sign less-than slash 
h 1 greater-than

Most of the characters shown just match themselves. For example, the @ 
sign will only match another @ sign. But some have special meaning to the 
regex:

\w doesn't match "backslash w", but any alphanumeric character;

+ doesn't match a plus sign, but tells the regex to match the previous 
symbol one or more times. Since it immediately follows \w, this means 
"match at least one alphanumeric character".

Now we feed that string into the re.compile, to create a pre-compiled 
regex. (This step is optional: any function which takes a compiled regex 
will also accept a string pattern. But pre-compiling regexes which you 
are going to use repeatedly is a good idea.)

regex = re.compile(pattern, re.I)

The second argument to re.compile is a flag, re.I which is a special 
value that tells the regular expression to ignore case, so "h" will match 
both "h" and "H".

Now on to use the regex. Here's a bunch of text to search:

text = """Now is the time for all good men blah blah blah <h1>spam</h1>
and more text here blah blah blah
and some more <h1>@victory</h1> blah blah blah"""

And we search it this way:

mo = re.search(regex, text)

"mo" stands for "Match Object", which is returned if the regular 
expression finds something that matches your pattern. If nothing matches, 
then None is returned instead.

if mo is not None:
    print(mo.group(0))

=> prints <h1>@victory</h1>

So far so good. But we can do better. In this case, we don't really care 
about the tags <h1>, we only care about the "victory" part. Here's how to 
use grouping to extract substrings from the regex:

pattern = r"<h1>@(\w+)</h1>"  # notice the round brackets ()
regex = re.compile(pattern, re.I)
mo = re.search(regex, text)
if mo is not None:
    print(mo.group(0))
    print(mo.group(1))

This prints:

<h1>@victory</h1>
victory

Hope this helps.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27295

From	Frank Koshti <frank.koshti@gmail.com>
Date	2012-08-18 07:53 -0700
Message-ID	<79aaa167-296a-4a0c-8a06-c4e67cf53597@j19g2000yqi.googlegroups.com>
In reply to	#27294

Hey Steven,

Thank you for the detailed (and well-written) tutorial on this very
issue. I actually learned a few things! Though, I still have
unresolved questions.

The reason I don't want to use an XML parser is because the tokens are
not always placed in HTML, and even in HTML, they may appear in
strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue is
I need to match, process and replace $foo(x=3), knowing that (x=3) is
optional, and the token might appear simply as $foo.

To do this, I decided to use:

re.compile('\$\w*\(?.*?\)').findall(mystring)

the issue with this is it doesn't match $foo by itself, and requires
there to be () at the end.

Thanks,
Frank

[toc] | [prev] | [next] | [standalone]

#27302

From	Peter Otten <__peter__@web.de>
Date	2012-08-18 17:48 +0200
Message-ID	<mailman.3455.1345304932.4697.python-list@python.org>
In reply to	#27295

Frank Koshti wrote:

> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

>>> s = """
... <h1>$foo1</h1>
... <p>$foo2()</p>
... <p>$foo3(anything could go here)</p>
... """
>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
['$foo1', '$foo2()', '$foo3(anything could go here)']

[toc] | [prev] | [next] | [standalone]

#27305

From	Frank Koshti <frank.koshti@gmail.com>
Date	2012-08-18 08:56 -0700
Message-ID	<c7326342-f9fc-426f-bb8a-38b9cea79ebc@y1g2000yqc.googlegroups.com>
In reply to	#27302

On Aug 18, 11:48 am, Peter Otten <__pete...@web.de> wrote:
> Frank Koshti wrote:
> > I need to match, process and replace $foo(x=3), knowing that (x=3) is
> > optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
> >>> s = """
>
> ... <h1>$foo1</h1>
> ... <p>$foo2()</p>
> ... <p>$foo3(anything could go here)</p>
> ... """>>> re.compile("(\$\w+(?:\(.*?\))?)").findall(s)
>
> ['$foo1', '$foo2()', '$foo3(anything could go here)']

PERFECT-

[toc] | [prev] | [next] | [standalone]

#27303

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2012-08-18 17:50 +0200
Message-ID	<mailman.3456.1345305020.4697.python-list@python.org>
In reply to	#27295

2012/8/18 Frank Koshti <frank.koshti@gmail.com>:
> Hey Steven,
>
> Thank you for the detailed (and well-written) tutorial on this very
> issue. I actually learned a few things! Though, I still have
> unresolved questions.
>
> The reason I don't want to use an XML parser is because the tokens are
> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue is
> I need to match, process and replace $foo(x=3), knowing that (x=3) is
> optional, and the token might appear simply as $foo.
>
> To do this, I decided to use:
>
> re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.
>
> Thanks,
> Frank
> --
> http://mail.python.org/mailman/listinfo/python-list

Hi,
Although I don't quite get the pattern you are using (with respect to
the specified task), you most likely need raw string syntax for the
pattern, e.g.: r"...", instead of "...", or you have to double all
backslashes (which should be escaped), i.e. \\w etc.

I am likely misunderstanding the specification, as the following:
>>> re.sub(r"\$foo\(x=3\)", "bar", "<h1 $foo(x=3)>Hello</h1>")
'<h1 bar>Hello</h1>'
>>>
is probably not the desired output.

For some kind of "processing" the matched text, you can use the
replace function instead of the replace pattern in re.sub too.
see
http://docs.python.org/library/re.html#re.sub

hth,
  vbr

[toc] | [prev] | [next] | [standalone]

#27308

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2012-08-18 19:22 +0300
Message-ID	<qota9xsqieq.fsf@ruuvi.it.helsinki.fi>
In reply to	#27295

Frank Koshti writes:

> not always placed in HTML, and even in HTML, they may appear in
> strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue
> is I need to match, process and replace $foo(x=3), knowing that
> (x=3) is optional, and the token might appear simply as $foo.
> 
> To do this, I decided to use:
> 
> re.compile('\$\w*\(?.*?\)').findall(mystring)
> 
> the issue with this is it doesn't match $foo by itself, and requires
> there to be () at the end.

Adding a ? after the meant-to-be-optional expression would let the
regex engine know what you want. You can also separate the mandatory
and the optional part in the regex to receive pairs as matches. The
test program below prints this:

>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm
('$foo', '')
('$foo', '(bar=3)')
('$foo', '($)')
('$foo', '')
('$bar', '(v=0)')

Here is the program:

import re

def grab(text):
    p = re.compile(r'([$]\w+)([(][^()]+[)])?')
    return re.findall(p, text)

def test(html):
    print(html)
    for hit in grab(html):
        print(hit)

if __name__ == '__main__':
    test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm')

[toc] | [prev] | [next] | [standalone]

#27330

From	Frank Koshti <frank.koshti@gmail.com>
Date	2012-08-18 13:18 -0700
Message-ID	<a3d60340-be24-4e6e-ac02-b77631f6a115@z6g2000vbc.googlegroups.com>
In reply to	#27308

On Aug 18, 12:22 pm, Jussi Piitulainen <jpiit...@ling.helsinki.fi>
wrote:
> Frank Koshti writes:
> > not always placed in HTML, and even in HTML, they may appear in
> > strange places, such as <h1 $foo(x=3)>Hello</h1>. My specific issue
> > is I need to match, process and replace $foo(x=3), knowing that
> > (x=3) is optional, and the token might appear simply as $foo.
>
> > To do this, I decided to use:
>
> > re.compile('\$\w*\(?.*?\)').findall(mystring)
>
> > the issue with this is it doesn't match $foo by itself, and requires
> > there to be () at the end.
>
> Adding a ? after the meant-to-be-optional expression would let the
> regex engine know what you want. You can also separate the mandatory
> and the optional part in the regex to receive pairs as matches. The
> test program below prints this:
>
> >$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm
>
> ('$foo', '')
> ('$foo', '(bar=3)')
> ('$foo', '($)')
> ('$foo', '')
> ('$bar', '(v=0)')
>
> Here is the program:
>
> import re
>
> def grab(text):
>     p = re.compile(r'([$]\w+)([(][^()]+[)])?')
>     return re.findall(p, text)
>
> def test(html):
>     print(html)
>     for hit in grab(html):
>         print(hit)
>
> if __name__ == '__main__':
>     test('>$foo()$foo(bar=3)$$$foo($)$foo($bar(v=0))etc</htm')

You read my mind. I didn't even know that's possible. Thank you-

[toc] | [prev] | [next] | [standalone]

#27309

From	python@bdurham.com
Date	2012-08-18 12:36 -0400
Message-ID	<mailman.3458.1345307814.4697.python-list@python.org>
In reply to	#27294

Steven,

Well done!!!

Regards,
Malcolm

[toc] | [prev] | [next] | [standalone]

#27317

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-08-18 13:30 -0400
Message-ID	<mailman.3464.1345311306.4697.python-list@python.org>
In reply to	#27282

On Fri, 17 Aug 2012 21:41:07 -0700 (PDT), Frank Koshti
<frank.koshti@gmail.com> declaimed the following in
gmane.comp.python.general:

> Hi,
> 
> I'm new to regular expressions. I want to be able to match for tokens
> with all their properties in the following examples. I would
> appreciate some direction on how to proceed.
> 
> 
> <h1>@foo1</h1>
> <p>@foo2()</p>
> <p>@foo3(anything could go here)</p>
>
	That looks like HTML... DON'T USE regular expressions -- use an HTML
parser (or for well-formed XHTML, an XML parser -- element-tree or
such), and walk the resultant structure...
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [standalone]

csiph-web

Regex Question

Contents

#27282 — Regex Question

#27284

#27290

#27292

#27293

#27294

#27295

#27302

#27305

#27303

#27308

#27330

#27309

#27317