Groups > comp.lang.python > #6529 > unrolled thread

Weird problem matching with REs

Started by	Andrew Berg <bahamutzero8825@gmail.com>
First post	2011-05-29 06:45 -0500
Last post	2011-05-29 21:06 +0200
Articles	15 — 7 participants

Back to article view | Back to comp.lang.python

  Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 06:45 -0500
    Re: Weird problem matching with REs Ben Finney <ben+python@benfinney.id.au> - 2011-05-29 23:00 +1000
      Re: Weird problem matching with REs Ben Finney <ben+python@benfinney.id.au> - 2011-05-29 23:03 +1000
      Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 08:29 -0500
    Re: Weird problem matching with REs Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-29 13:09 +0000
      Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 08:41 -0500
        Re: Weird problem matching with REs Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-29 14:18 +0000
          Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 09:35 -0500
            Re: Weird problem matching with REs John S <jstrickler@gmail.com> - 2011-05-29 08:48 -0700
              Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 11:16 -0500
                Re: Weird problem matching with REs John S <jstrickler@gmail.com> - 2011-05-29 09:45 -0700
              Re: Weird problem matching with REs Chris Angelico <rosuav@gmail.com> - 2011-05-30 03:57 +1000
        Re: Weird problem matching with REs Roy Smith <roy@panix.com> - 2011-05-29 11:19 -0400
          Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 10:31 -0500
            Re: Weird problem matching with REs Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-05-29 21:06 +0200

#6529 — Weird problem matching with REs

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 06:45 -0500
Subject	Weird problem matching with REs
Message-ID	<mailman.2220.1306669538.9059.python-list@python.org>

I have an RE that should work (it even works in Kodos [1], but not in my
code), but it keeps failing to match characters after a newline.

I'm writing a little program that scans the webpage of an arbitrary
application and gets the newest version advertised on the page.


test3.py:
> # -*- coding: utf-8 -*-
>
> import configparser
> import re
> import urllib.request
> import os
> import sys
> import logging
> import collections
>
>
> class CouldNotFindVersion(Exception):
>     def __init__(self, app_name, reason, exc_value):
>         self.value = 'The latest version of ' + app_name + ' could not
> be determined because ' + reason
>         self.cause = exc_value
>     def __str__(self):
>         return repr(self.value)
>
> class AppUpdateItem():
>     def __init__(self, config_file_name, config_file_section):
>         self.section = config_file_section
>         self.name = self.section['Name']
>         self.url = self.section['URL']
>         self.filename = self.section['Filename']
>         self.file_re = re.compile(self.section['FileURLRegex'])
>         self.ver_re = re.compile(self.section['VersionRegex'])
>         self.prev_ver = self.section['CurrentVersion']
>         try:
>             self.page = str(urllib.request.urlopen(self.url).read(),
> encoding='utf-8')
>             self.file_URL = self.file_re.findall(self.page)[0] #here
> is where it fails
>             self.last_ver = self.ver_re.findall(self.file_URL)[0]
>         except urllib.error.URLError:
>             self.error = str(sys.exc_info()[1])
>             logging.info('[' + self.name + ']' + ' Could not load URL:
> ' + self.url + ' : ' + self.error)
>             self.success = False
>             raise CouldNotFindVersion(self.name, self.error,
> sys.exc_info()[0])
>         except IndexError:
>             logging.warning('Regex did not return a match.')
>     def update_ini(self):
>         self.section['CurrentVersion'] = self.last_ver
>         with open(config_file_name, 'w') as configfile:
>             config.write(configfile)
>     def rollback_ini(self):
>         self.section['CurrentVersion'] = self.prev_ver
>         with open(config_file_name, 'w') as configfile:
>             config.write(configfile)
>     def download_file(self):
>         self.__filename = self.section['Filename']
>         with open(self.__filename, 'wb') as file:
>             self.__file_req = urllib.request.urlopen(self.file_URL).read()
>             file.write(self.__file_req)
>
>
> if __name__ == '__main__':
>     config = configparser.ConfigParser()
>     config_file = 'checklist.ini'
>     config.read(config_file)
>     queue = collections.deque()
>     for section in config.sections():
>         try:
>             queue.append(AppUpdateItem(config_file, config[section]))
>         except CouldNotFindVersion as exc:
>             logging.warning(exc.value)
>     for elem in queue:
>         if elem.last_ver != elem.prev_ver:
>             elem.update_ini()
>             try:
>                 elem.download_file()
>             except IOError:
>                 logging.warning('[' + elem.name + '] Download failed.')
>             except:
>                 elem.rollback_ini()
>         print(elem.name + ' succeeded.')

checklist.ini:
> [x264_64]
> name = x264 (64-bit)
> filename = x264.exe
> url = http://x264.nl/x264_main.php
> fileurlregex =
> http://x264.nl/x264/64bit/8bit_depth/revision\n{0,3}[0-9]{4}\n{0,3}/x264\n{0,3}.exe
> versionregex = [0-9]{4}
> currentversion = 1995

The part it's supposed to match in http://x264.nl/x264_main.php:
> <a href="http://x264.nl/x264/64bit/8bit_depth/revision
> 1995
> /x264
>
> .exe <view-source-tab:http://x264.nl/x264/64bit/8bit_depth/revision%0A1995%0A/x264%0A%0A.exe>" 
I was able to make a regex that matches in my code, but it shouldn't:
http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
I have to add a dot before each "\n". There is no character not
accounted for before those newlines, but I don't get a match without the
dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
really confused.

Using Python 3.2 on Windows, in case it matters.


[1] http://kodos.sourceforge.net/ (using the compiled Win32 version
since it doesn't work with Python 3)

[toc] | [next] | [standalone]

#6532

From	Ben Finney <ben+python@benfinney.id.au>
Date	2011-05-29 23:00 +1000
Message-ID	<8739jxacgt.fsf@benfinney.id.au>
In reply to	#6529

Andrew Berg <bahamutzero8825@gmail.com> writes:

> I was able to make a regex that matches in my code, but it shouldn't:
> http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
> I have to add a dot before each "\n". There is no character not
> accounted for before those newlines, but I don't get a match without the
> dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
> really confused.
>
> Using Python 3.2 on Windows, in case it matters.

You are aware that most text-emitting processes on Windows, and Internet
text protocols like the HTTP standard, use the two-character “CR LF”
sequence (U+000C U+000A) for terminating lines?

    <URL:http://en.wikipedia.org/wiki/Newline>

-- 
 \          “What I have to do is see, at any rate, that I do not lend |
  `\      myself to the wrong which I condemn.” —Henry Thoreau, _Civil |
_o__)                                                    Disobedience_ |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#6533

From	Ben Finney <ben+python@benfinney.id.au>
Date	2011-05-29 23:03 +1000
Message-ID	<87y61p8xqq.fsf@benfinney.id.au>
In reply to	#6532

Ben Finney <ben+python@benfinney.id.au> writes:

> the two-character “CR LF” sequence (U+000C U+000A)
>     <URL:http://en.wikipedia.org/wiki/Newline>

As detailed in that Wikipedia article, the characters are of course
U+000D U+000A.

-- 
 \      “You say “Carmina”, and I say “Burana”, You say “Fortuna”, and |
  `\    I say “cantata”, Carmina, Burana, Fortuna, cantata, Let's Carl |
_o__)                                the whole thing Orff.” —anonymous |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#6535

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 08:29 -0500
Message-ID	<mailman.2221.1306675766.9059.python-list@python.org>
In reply to	#6532

On 2011.05.29 08:00 AM, Ben Finney wrote:
> You are aware that most text-emitting processes on Windows, and Internet
> text protocols like the HTTP standard, use the two-character “CR LF”
> sequence (U+000C U+000A) for terminating lines?
Yes, but I was not having trouble with just '\n' before, and the pattern
did match in Kodos, so I figured Python was doing its newline magic like
it does with the write() method for file objects.
http://x264.nl/x264/64bit/8bit_depth/revision[\r\n]{1,3}[0-9]{4}[\r\n]{1,3}/x264[\r\n]{1,3}.exe
does indeed match. One thing that confuses me, though (and one reason I
dismissed the possibility of it being a newline issue): isn't '.'
supposed to not match '\r'?

[toc] | [prev] | [next] | [standalone]

#6534

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-05-29 13:09 +0000
Message-ID	<4de2459b$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6529

On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:

> I have an RE that should work (it even works in Kodos [1], but not in my
> code), but it keeps failing to match characters after a newline.

Not all regexes are the same. Different regex engines accept different 
symbols, and sometimes behave differently, or have different default 
behavior. That your regex works in Kodos but not Python might mean you're 
writing a Kodus regex instead of a Python regex.

> I'm writing a little program that scans the webpage of an arbitrary
> application and gets the newest version advertised on the page.

Firstly, most of the code you show is irrelevant to the problem. Please 
simplify it to the shortest, most simple example you can give. That would 
be a simplified piece of text (not the entire web page!), the regex, and 
the failed attempt to use it. The rest of your code is just noise for the 
purposes of solving this problem.

Secondly, you probably should use a proper HTML parser, rather than a 
regex. Resist the temptation to use regexes to rip out bits of text from 
HTML, it almost always goes wrong eventually.

> I was able to make a regex that matches in my code, but it shouldn't:
> http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/
x264.\n{1,3}.\n{1,3}.exe

What makes you think it shouldn't match?

By the way, you probably should escape the dots, otherwise it will match 
strings containing any arbitrary character, rather than *just* dots:

http://x264Znl ...blah blah blah

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#6536

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 08:41 -0500
Message-ID	<mailman.2222.1306676482.9059.python-list@python.org>
In reply to	#6534

On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
> On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:
>
> > I have an RE that should work (it even works in Kodos [1], but not in my
> > code), but it keeps failing to match characters after a newline.
>
> Not all regexes are the same. Different regex engines accept different 
> symbols, and sometimes behave differently, or have different default 
> behavior. That your regex works in Kodos but not Python might mean you're 
> writing a Kodus regex instead of a Python regex.
Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.
> Firstly, most of the code you show is irrelevant to the problem. Please 
> simplify it to the shortest, most simple example you can give. That would 
> be a simplified piece of text (not the entire web page!), the regex, and 
> the failed attempt to use it. The rest of your code is just noise for the 
> purposes of solving this problem.
I wasn't sure how much would be relevant since it could've been a
problem with other code. I do apologize for not putting more effort into
trimming it down, though.
> Secondly, you probably should use a proper HTML parser, rather than a 
> regex. Resist the temptation to use regexes to rip out bits of text from 
> HTML, it almost always goes wrong eventually.
I find this a much simpler approach, especially since I'm dealing with
broken HTML. I guess I don't see how the effort put into learning a
parser and adding the extra code to use it pays off in this particular
endeavor.
> > I was able to make a regex that matches in my code, but it shouldn't:
> > http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/
> x264.\n{1,3}.\n{1,3}.exe
>
> What makes you think it shouldn't match?
AFAIK, dots aren't supposed to match carriage returns or any other
whitespace characters.
> By the way, you probably should escape the dots, otherwise it will match 
> strings containing any arbitrary character, rather than *just* dots:
You're right; I overlooked the dots in the URL.

[toc] | [prev] | [next] | [standalone]

#6541

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-05-29 14:18 +0000
Message-ID	<4de255a8$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#6536

On Sun, 29 May 2011 08:41:16 -0500, Andrew Berg wrote:

> On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
[...]
> Kodos is written in Python and uses Python's regex engine. In fact, it
> is specifically intended to debug Python regexes.

Fair enough.

>> Secondly, you probably should use a proper HTML parser, rather than a
>> regex. Resist the temptation to use regexes to rip out bits of text
>> from HTML, it almost always goes wrong eventually.
>
> I find this a much simpler approach, especially since I'm dealing with
> broken HTML. I guess I don't see how the effort put into learning a
> parser and adding the extra code to use it pays off in this particular
> endeavor.

The temptation to take short-cuts leads to the Dark Side :)

Perhaps you're right, in this instance. But if you need to deal with 
broken HTML, try BeautifulSoup.


>> What makes you think it shouldn't match?
> 
> AFAIK, dots aren't supposed to match carriage returns or any other
> whitespace characters.

They won't match *newlines* \n unless you pass the DOTALL flag, but they 
do match whitespace:

>>> re.search('abc.efg', '----abc efg----').group()
'abc efg'
>>> re.search('abc.efg', '----abc\refg----').group()
'abc\refg'
>>> re.search('abc.efg', '----abc\nefg----') is None
True


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#6542

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 09:35 -0500
Message-ID	<mailman.2223.1306679725.9059.python-list@python.org>
In reply to	#6541

On 2011.05.29 09:18 AM, Steven D'Aprano wrote:
> >> What makes you think it shouldn't match?
> > 
> > AFAIK, dots aren't supposed to match carriage returns or any other
> > whitespace characters.
>
> They won't match *newlines* \n unless you pass the DOTALL flag, but they 
> do match whitespace:
>
> >>> re.search('abc.efg', '----abc efg----').group()
> 'abc efg'
> >>> re.search('abc.efg', '----abc\refg----').group()
> 'abc\refg'
> >>> re.search('abc.efg', '----abc\nefg----') is None
> True
I got things mixed up there (was thinking whitespace instead of
newlines), but I thought dots aren't supposed to match '\r' (carriage
return). Why is '\r' not considered a newline character?

[toc] | [prev] | [next] | [standalone]

#6548

From	John S <jstrickler@gmail.com>
Date	2011-05-29 08:48 -0700
Message-ID	<1b8d81c1-ab87-4059-ad55-9f4a39331e7d@u26g2000vby.googlegroups.com>
In reply to	#6542

On May 29, 10:35 am, Andrew Berg <bahamutzero8...@gmail.com> wrote:
> On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it shouldn't match?
>
> > > AFAIK, dots aren't supposed to match carriage returns or any other
> > > whitespace characters.
>
> I got things mixed up there (was thinking whitespace instead of
> newlines), but I thought dots aren't supposed to match '\r' (carriage
> return). Why is '\r' not considered a newline character?

Dots don't match end-of-line-for-your-current-OS is how I think of
it.

While I almost usually nod my head at Steven D'Aprano's comments, in
this case I have to say that if you just want to grab something from a
chunk of HTML, full-blown HTML parsers are overkill. True, malformed
HTML can throw you off, but they can also throw a parser off.

I could not make your regex work on my Linux box with Python 2.6.

In your case, and because x264 might change their HTML, I suggest the
following code, which works great on my system.YMMV. I changed your
newline matches to use \s and put some capturing parentheses around
the date, so you could grab it.

>>> import urllib2
>>> import re
>>>
>>> content = urllib2.urlopen("http://x264.nl/x264_main.php").read()
>>>
>>> rx_x264version= re.compile(r"http://x264\.nl/x264/64bit/8bit_depth/revision\s*(\d{4})\s*/x264\s*\.exe")
>>>
>>> m = rx_x264version.search(content)
>>> if m:
...     print m.group(1)
...
1995
>>>

\s is your friend -- matches space, tab, newline, or carriage return.
\s* says match 0 or more spaces, which is what's needed here in case
the web site decides to *not* put whitespace in the middle of a URL...

As Steven said, when you want match a dot, it needs to be escaped,
although it will work by accident much of the time. Also, be sure to
use a raw string when composing REs, so you don't run into backslash
issues.

HTH,
John Strickler

[toc] | [prev] | [next] | [standalone]

#6549

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 11:16 -0500
Message-ID	<mailman.2226.1306685804.9059.python-list@python.org>
In reply to	#6548

On 2011.05.29 10:48 AM, John S wrote:
> Dots don't match end-of-line-for-your-current-OS is how I think of
> it.
IMO, the docs should say the dot matches any character except a line
feed ('\n'), since that is more accurate.
> True, malformed
> HTML can throw you off, but they can also throw a parser off.
That was part of my point. html.parser.HTMLParser from the standard
library will definitely not work on x264.nl's broken HTML, and fixing it
requires lxml (I'm working with Python 3; I've looked into
BeautifulSoup, and does not work with Python 3 at all). Admittedly,
fixing x264.nl's HTML only requires one or two lines of code, but really
nasty HTML might require quite a bit of work.
> In your case, and because x264 might change their HTML, I suggest the
> following code, which works great on my system.YMMV. I changed your
> newline matches to use \s and put some capturing parentheses around
> the date, so you could grab it.
I've been meaning to learn how to use parenthesis groups.
> Also, be sure to
> use a raw string when composing REs, so you don't run into backslash
> issues.
How would I do that when grabbing strings from a config file (via the
configparser module)? Or rather, if I have a predefined variable
containing a string, how do change it into a raw string?

[toc] | [prev] | [next] | [standalone]

#6550

From	John S <jstrickler@gmail.com>
Date	2011-05-29 09:45 -0700
Message-ID	<4a265c16-52fb-4483-8bc2-a853c6d18220@dr5g2000vbb.googlegroups.com>
In reply to	#6549

On May 29, 12:16 pm, Andrew Berg <bahamutzero8...@gmail.com> wrote:
>
> I've been meaning to learn how to use parenthesis groups.
> > Also, be sure to
> > use a raw string when composing REs, so you don't run into backslash
> > issues.
>
> How would I do that when grabbing strings from a config file (via the
> configparser module)? Or rather, if I have a predefined variable
> containing a string, how do change it into a raw string?
When reading the RE from a file it's not an issue. Only literal
strings can be raw. If the data is in a file, the data will not be
parsed by the Python interpreter. This was just a general warning to
anyone working with REs. It didn't apply in this case.

--john strickler

[toc] | [prev] | [next] | [standalone]

#6556

From	Chris Angelico <rosuav@gmail.com>
Date	2011-05-30 03:57 +1000
Message-ID	<mailman.2230.1306691862.9059.python-list@python.org>
In reply to	#6548

On Mon, May 30, 2011 at 2:16 AM, Andrew Berg <bahamutzero8825@gmail.com> wrote:
>> Also, be sure to
>> use a raw string when composing REs, so you don't run into backslash
>> issues.
> How would I do that when grabbing strings from a config file (via the
> configparser module)? Or rather, if I have a predefined variable
> containing a string, how do change it into a raw string?
>

"Raw string" is slightly inaccurate. The Python "raw string literal"
syntax is just another form of string literal:

'apostrophe-delimited string'
"quote-delimited string"
"""triple-quote string
which may
go over
multiple lines"""
'''triple-apostrophe string'''
r'raw apostrophe string'
r"raw quote string"

They're all equivalent once you have the string object. The only
difference is how they appear in your source code. If you read
something from a config file, you get a string object directly, and
you delimit it with something else (end of line, or XML closing tag,
or whatever), so you don't have to worry about string quotes.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#6545

From	Roy Smith <roy@panix.com>
Date	2011-05-29 11:19 -0400
Message-ID	<roy-7C21B7.11191129052011@news.panix.com>
In reply to	#6536

In article <mailman.2222.1306676482.9059.python-list@python.org>,
 Andrew Berg <bahamutzero8825@gmail.com> wrote:

> Kodos is written in Python and uses Python's regex engine. In fact, it
> is specifically intended to debug Python regexes.

Named after the governor of Tarsus IV?

[toc] | [prev] | [next] | [standalone]

#6546

From	Andrew Berg <bahamutzero8825@gmail.com>
Date	2011-05-29 10:31 -0500
Message-ID	<mailman.2224.1306683109.9059.python-list@python.org>
In reply to	#6545

On 2011.05.29 10:19 AM, Roy Smith wrote:
> Named after the governor of Tarsus IV?
Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html ,
it's named after the Simpsons character.

[toc] | [prev] | [next] | [standalone]

#6564

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2011-05-29 21:06 +0200
Message-ID	<4248834.rdbgypaU67@PointedEars.de>
In reply to	#6546

Andrew Berg wrote:

> On 2011.05.29 10:19 AM, Roy Smith wrote:
>> Named after the governor of Tarsus IV?
> Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html ,
> it's named after the Simpsons character.

<OT>

I don't think that's a coincidence; both are from other planets and both are 
rather evil[tm].  Kodos the Executioner, arguably human, became a dictator 
who had thousands killed (by his own account, not to let the rest die of 
hunger); Kodos the slimy extra-terrestrial is a conqueror (and he likes to 
zap humans as well ;-))

[BTW, Tarsus IV, a planet where thousands (would) have died of hunger and 
have died in executions was probably yet another hidden Star Trek euphemism.  
I have found out that Tarsus is, among other things, the name of a 
collection of bones in the human foot next to the heel.  Bones as a 
reference to death aside, see also Achilles for the heel.  But I'm only 
speculating here.]

</OT>

-- 
\\//, PointedEars (F'up2 trek)

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [standalone]

csiph-web

Weird problem matching with REs

Contents

#6529 — Weird problem matching with REs

#6532

#6533

#6535

#6534

#6536

#6541

#6542

#6548

#6549

#6550

#6556

#6545

#6546

#6564