Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #6529 > unrolled thread
| Started by | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| First post | 2011-05-29 06:45 -0500 |
| Last post | 2011-05-29 21:06 +0200 |
| Articles | 15 — 7 participants |
Back to article view | Back to comp.lang.python
Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 06:45 -0500
Re: Weird problem matching with REs Ben Finney <ben+python@benfinney.id.au> - 2011-05-29 23:00 +1000
Re: Weird problem matching with REs Ben Finney <ben+python@benfinney.id.au> - 2011-05-29 23:03 +1000
Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 08:29 -0500
Re: Weird problem matching with REs Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-29 13:09 +0000
Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 08:41 -0500
Re: Weird problem matching with REs Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-29 14:18 +0000
Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 09:35 -0500
Re: Weird problem matching with REs John S <jstrickler@gmail.com> - 2011-05-29 08:48 -0700
Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 11:16 -0500
Re: Weird problem matching with REs John S <jstrickler@gmail.com> - 2011-05-29 09:45 -0700
Re: Weird problem matching with REs Chris Angelico <rosuav@gmail.com> - 2011-05-30 03:57 +1000
Re: Weird problem matching with REs Roy Smith <roy@panix.com> - 2011-05-29 11:19 -0400
Re: Weird problem matching with REs Andrew Berg <bahamutzero8825@gmail.com> - 2011-05-29 10:31 -0500
Re: Weird problem matching with REs Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-05-29 21:06 +0200
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 06:45 -0500 |
| Subject | Weird problem matching with REs |
| Message-ID | <mailman.2220.1306669538.9059.python-list@python.org> |
I have an RE that should work (it even works in Kodos [1], but not in my
code), but it keeps failing to match characters after a newline.
I'm writing a little program that scans the webpage of an arbitrary
application and gets the newest version advertised on the page.
test3.py:
> # -*- coding: utf-8 -*-
>
> import configparser
> import re
> import urllib.request
> import os
> import sys
> import logging
> import collections
>
>
> class CouldNotFindVersion(Exception):
> def __init__(self, app_name, reason, exc_value):
> self.value = 'The latest version of ' + app_name + ' could not
> be determined because ' + reason
> self.cause = exc_value
> def __str__(self):
> return repr(self.value)
>
> class AppUpdateItem():
> def __init__(self, config_file_name, config_file_section):
> self.section = config_file_section
> self.name = self.section['Name']
> self.url = self.section['URL']
> self.filename = self.section['Filename']
> self.file_re = re.compile(self.section['FileURLRegex'])
> self.ver_re = re.compile(self.section['VersionRegex'])
> self.prev_ver = self.section['CurrentVersion']
> try:
> self.page = str(urllib.request.urlopen(self.url).read(),
> encoding='utf-8')
> self.file_URL = self.file_re.findall(self.page)[0] #here
> is where it fails
> self.last_ver = self.ver_re.findall(self.file_URL)[0]
> except urllib.error.URLError:
> self.error = str(sys.exc_info()[1])
> logging.info('[' + self.name + ']' + ' Could not load URL:
> ' + self.url + ' : ' + self.error)
> self.success = False
> raise CouldNotFindVersion(self.name, self.error,
> sys.exc_info()[0])
> except IndexError:
> logging.warning('Regex did not return a match.')
> def update_ini(self):
> self.section['CurrentVersion'] = self.last_ver
> with open(config_file_name, 'w') as configfile:
> config.write(configfile)
> def rollback_ini(self):
> self.section['CurrentVersion'] = self.prev_ver
> with open(config_file_name, 'w') as configfile:
> config.write(configfile)
> def download_file(self):
> self.__filename = self.section['Filename']
> with open(self.__filename, 'wb') as file:
> self.__file_req = urllib.request.urlopen(self.file_URL).read()
> file.write(self.__file_req)
>
>
> if __name__ == '__main__':
> config = configparser.ConfigParser()
> config_file = 'checklist.ini'
> config.read(config_file)
> queue = collections.deque()
> for section in config.sections():
> try:
> queue.append(AppUpdateItem(config_file, config[section]))
> except CouldNotFindVersion as exc:
> logging.warning(exc.value)
> for elem in queue:
> if elem.last_ver != elem.prev_ver:
> elem.update_ini()
> try:
> elem.download_file()
> except IOError:
> logging.warning('[' + elem.name + '] Download failed.')
> except:
> elem.rollback_ini()
> print(elem.name + ' succeeded.')
checklist.ini:
> [x264_64]
> name = x264 (64-bit)
> filename = x264.exe
> url = http://x264.nl/x264_main.php
> fileurlregex =
> http://x264.nl/x264/64bit/8bit_depth/revision\n{0,3}[0-9]{4}\n{0,3}/x264\n{0,3}.exe
> versionregex = [0-9]{4}
> currentversion = 1995
The part it's supposed to match in http://x264.nl/x264_main.php:
> <a href="http://x264.nl/x264/64bit/8bit_depth/revision
> 1995
> /x264
>
> .exe <view-source-tab:http://x264.nl/x264/64bit/8bit_depth/revision%0A1995%0A/x264%0A%0A.exe>"
I was able to make a regex that matches in my code, but it shouldn't:
http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
I have to add a dot before each "\n". There is no character not
accounted for before those newlines, but I don't get a match without the
dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
really confused.
Using Python 3.2 on Windows, in case it matters.
[1] http://kodos.sourceforge.net/ (using the compiled Win32 version
since it doesn't work with Python 3)
[toc] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2011-05-29 23:00 +1000 |
| Message-ID | <8739jxacgt.fsf@benfinney.id.au> |
| In reply to | #6529 |
Andrew Berg <bahamutzero8825@gmail.com> writes:
> I was able to make a regex that matches in my code, but it shouldn't:
> http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
> I have to add a dot before each "\n". There is no character not
> accounted for before those newlines, but I don't get a match without the
> dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
> really confused.
>
> Using Python 3.2 on Windows, in case it matters.
You are aware that most text-emitting processes on Windows, and Internet
text protocols like the HTTP standard, use the two-character “CR LF”
sequence (U+000C U+000A) for terminating lines?
<URL:http://en.wikipedia.org/wiki/Newline>
--
\ “What I have to do is see, at any rate, that I do not lend |
`\ myself to the wrong which I condemn.” —Henry Thoreau, _Civil |
_o__) Disobedience_ |
Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2011-05-29 23:03 +1000 |
| Message-ID | <87y61p8xqq.fsf@benfinney.id.au> |
| In reply to | #6532 |
Ben Finney <ben+python@benfinney.id.au> writes: > the two-character “CR LF” sequence (U+000C U+000A) > <URL:http://en.wikipedia.org/wiki/Newline> As detailed in that Wikipedia article, the characters are of course U+000D U+000A. -- \ “You say “Carmina”, and I say “Burana”, You say “Fortuna”, and | `\ I say “cantata”, Carmina, Burana, Fortuna, cantata, Let's Carl | _o__) the whole thing Orff.” —anonymous | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 08:29 -0500 |
| Message-ID | <mailman.2221.1306675766.9059.python-list@python.org> |
| In reply to | #6532 |
On 2011.05.29 08:00 AM, Ben Finney wrote:
> You are aware that most text-emitting processes on Windows, and Internet
> text protocols like the HTTP standard, use the two-character “CR LF”
> sequence (U+000C U+000A) for terminating lines?
Yes, but I was not having trouble with just '\n' before, and the pattern
did match in Kodos, so I figured Python was doing its newline magic like
it does with the write() method for file objects.
http://x264.nl/x264/64bit/8bit_depth/revision[\r\n]{1,3}[0-9]{4}[\r\n]{1,3}/x264[\r\n]{1,3}.exe
does indeed match. One thing that confuses me, though (and one reason I
dismissed the possibility of it being a newline issue): isn't '.'
supposed to not match '\r'?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-05-29 13:09 +0000 |
| Message-ID | <4de2459b$0$29996$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #6529 |
On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:
> I have an RE that should work (it even works in Kodos [1], but not in my
> code), but it keeps failing to match characters after a newline.
Not all regexes are the same. Different regex engines accept different
symbols, and sometimes behave differently, or have different default
behavior. That your regex works in Kodos but not Python might mean you're
writing a Kodus regex instead of a Python regex.
> I'm writing a little program that scans the webpage of an arbitrary
> application and gets the newest version advertised on the page.
Firstly, most of the code you show is irrelevant to the problem. Please
simplify it to the shortest, most simple example you can give. That would
be a simplified piece of text (not the entire web page!), the regex, and
the failed attempt to use it. The rest of your code is just noise for the
purposes of solving this problem.
Secondly, you probably should use a proper HTML parser, rather than a
regex. Resist the temptation to use regexes to rip out bits of text from
HTML, it almost always goes wrong eventually.
> I was able to make a regex that matches in my code, but it shouldn't:
> http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/
x264.\n{1,3}.\n{1,3}.exe
What makes you think it shouldn't match?
By the way, you probably should escape the dots, otherwise it will match
strings containing any arbitrary character, rather than *just* dots:
http://x264Znl ...blah blah blah
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 08:41 -0500 |
| Message-ID | <mailman.2222.1306676482.9059.python-list@python.org> |
| In reply to | #6534 |
On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
> On Sun, 29 May 2011 06:45:30 -0500, Andrew Berg wrote:
>
> > I have an RE that should work (it even works in Kodos [1], but not in my
> > code), but it keeps failing to match characters after a newline.
>
> Not all regexes are the same. Different regex engines accept different
> symbols, and sometimes behave differently, or have different default
> behavior. That your regex works in Kodos but not Python might mean you're
> writing a Kodus regex instead of a Python regex.
Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.
> Firstly, most of the code you show is irrelevant to the problem. Please
> simplify it to the shortest, most simple example you can give. That would
> be a simplified piece of text (not the entire web page!), the regex, and
> the failed attempt to use it. The rest of your code is just noise for the
> purposes of solving this problem.
I wasn't sure how much would be relevant since it could've been a
problem with other code. I do apologize for not putting more effort into
trimming it down, though.
> Secondly, you probably should use a proper HTML parser, rather than a
> regex. Resist the temptation to use regexes to rip out bits of text from
> HTML, it almost always goes wrong eventually.
I find this a much simpler approach, especially since I'm dealing with
broken HTML. I guess I don't see how the effort put into learning a
parser and adding the extra code to use it pays off in this particular
endeavor.
> > I was able to make a regex that matches in my code, but it shouldn't:
> > http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/
> x264.\n{1,3}.\n{1,3}.exe
>
> What makes you think it shouldn't match?
AFAIK, dots aren't supposed to match carriage returns or any other
whitespace characters.
> By the way, you probably should escape the dots, otherwise it will match
> strings containing any arbitrary character, rather than *just* dots:
You're right; I overlooked the dots in the URL.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-05-29 14:18 +0000 |
| Message-ID | <4de255a8$0$29996$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #6536 |
On Sun, 29 May 2011 08:41:16 -0500, Andrew Berg wrote:
> On 2011.05.29 08:09 AM, Steven D'Aprano wrote:
[...]
> Kodos is written in Python and uses Python's regex engine. In fact, it
> is specifically intended to debug Python regexes.
Fair enough.
>> Secondly, you probably should use a proper HTML parser, rather than a
>> regex. Resist the temptation to use regexes to rip out bits of text
>> from HTML, it almost always goes wrong eventually.
>
> I find this a much simpler approach, especially since I'm dealing with
> broken HTML. I guess I don't see how the effort put into learning a
> parser and adding the extra code to use it pays off in this particular
> endeavor.
The temptation to take short-cuts leads to the Dark Side :)
Perhaps you're right, in this instance. But if you need to deal with
broken HTML, try BeautifulSoup.
>> What makes you think it shouldn't match?
>
> AFAIK, dots aren't supposed to match carriage returns or any other
> whitespace characters.
They won't match *newlines* \n unless you pass the DOTALL flag, but they
do match whitespace:
>>> re.search('abc.efg', '----abc efg----').group()
'abc efg'
>>> re.search('abc.efg', '----abc\refg----').group()
'abc\refg'
>>> re.search('abc.efg', '----abc\nefg----') is None
True
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 09:35 -0500 |
| Message-ID | <mailman.2223.1306679725.9059.python-list@python.org> |
| In reply to | #6541 |
On 2011.05.29 09:18 AM, Steven D'Aprano wrote:
> >> What makes you think it shouldn't match?
> >
> > AFAIK, dots aren't supposed to match carriage returns or any other
> > whitespace characters.
>
> They won't match *newlines* \n unless you pass the DOTALL flag, but they
> do match whitespace:
>
> >>> re.search('abc.efg', '----abc efg----').group()
> 'abc efg'
> >>> re.search('abc.efg', '----abc\refg----').group()
> 'abc\refg'
> >>> re.search('abc.efg', '----abc\nefg----') is None
> True
I got things mixed up there (was thinking whitespace instead of
newlines), but I thought dots aren't supposed to match '\r' (carriage
return). Why is '\r' not considered a newline character?
[toc] | [prev] | [next] | [standalone]
| From | John S <jstrickler@gmail.com> |
|---|---|
| Date | 2011-05-29 08:48 -0700 |
| Message-ID | <1b8d81c1-ab87-4059-ad55-9f4a39331e7d@u26g2000vby.googlegroups.com> |
| In reply to | #6542 |
On May 29, 10:35 am, Andrew Berg <bahamutzero8...@gmail.com> wrote:
> On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it shouldn't match?
>
> > > AFAIK, dots aren't supposed to match carriage returns or any other
> > > whitespace characters.
>
> I got things mixed up there (was thinking whitespace instead of
> newlines), but I thought dots aren't supposed to match '\r' (carriage
> return). Why is '\r' not considered a newline character?
Dots don't match end-of-line-for-your-current-OS is how I think of
it.
While I almost usually nod my head at Steven D'Aprano's comments, in
this case I have to say that if you just want to grab something from a
chunk of HTML, full-blown HTML parsers are overkill. True, malformed
HTML can throw you off, but they can also throw a parser off.
I could not make your regex work on my Linux box with Python 2.6.
In your case, and because x264 might change their HTML, I suggest the
following code, which works great on my system.YMMV. I changed your
newline matches to use \s and put some capturing parentheses around
the date, so you could grab it.
>>> import urllib2
>>> import re
>>>
>>> content = urllib2.urlopen("http://x264.nl/x264_main.php").read()
>>>
>>> rx_x264version= re.compile(r"http://x264\.nl/x264/64bit/8bit_depth/revision\s*(\d{4})\s*/x264\s*\.exe")
>>>
>>> m = rx_x264version.search(content)
>>> if m:
... print m.group(1)
...
1995
>>>
\s is your friend -- matches space, tab, newline, or carriage return.
\s* says match 0 or more spaces, which is what's needed here in case
the web site decides to *not* put whitespace in the middle of a URL...
As Steven said, when you want match a dot, it needs to be escaped,
although it will work by accident much of the time. Also, be sure to
use a raw string when composing REs, so you don't run into backslash
issues.
HTH,
John Strickler
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 11:16 -0500 |
| Message-ID | <mailman.2226.1306685804.9059.python-list@python.org> |
| In reply to | #6548 |
On 2011.05.29 10:48 AM, John S wrote:
> Dots don't match end-of-line-for-your-current-OS is how I think of
> it.
IMO, the docs should say the dot matches any character except a line
feed ('\n'), since that is more accurate.
> True, malformed
> HTML can throw you off, but they can also throw a parser off.
That was part of my point. html.parser.HTMLParser from the standard
library will definitely not work on x264.nl's broken HTML, and fixing it
requires lxml (I'm working with Python 3; I've looked into
BeautifulSoup, and does not work with Python 3 at all). Admittedly,
fixing x264.nl's HTML only requires one or two lines of code, but really
nasty HTML might require quite a bit of work.
> In your case, and because x264 might change their HTML, I suggest the
> following code, which works great on my system.YMMV. I changed your
> newline matches to use \s and put some capturing parentheses around
> the date, so you could grab it.
I've been meaning to learn how to use parenthesis groups.
> Also, be sure to
> use a raw string when composing REs, so you don't run into backslash
> issues.
How would I do that when grabbing strings from a config file (via the
configparser module)? Or rather, if I have a predefined variable
containing a string, how do change it into a raw string?
[toc] | [prev] | [next] | [standalone]
| From | John S <jstrickler@gmail.com> |
|---|---|
| Date | 2011-05-29 09:45 -0700 |
| Message-ID | <4a265c16-52fb-4483-8bc2-a853c6d18220@dr5g2000vbb.googlegroups.com> |
| In reply to | #6549 |
On May 29, 12:16 pm, Andrew Berg <bahamutzero8...@gmail.com> wrote: > > I've been meaning to learn how to use parenthesis groups. > > Also, be sure to > > use a raw string when composing REs, so you don't run into backslash > > issues. > > How would I do that when grabbing strings from a config file (via the > configparser module)? Or rather, if I have a predefined variable > containing a string, how do change it into a raw string? When reading the RE from a file it's not an issue. Only literal strings can be raw. If the data is in a file, the data will not be parsed by the Python interpreter. This was just a general warning to anyone working with REs. It didn't apply in this case. --john strickler
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-05-30 03:57 +1000 |
| Message-ID | <mailman.2230.1306691862.9059.python-list@python.org> |
| In reply to | #6548 |
On Mon, May 30, 2011 at 2:16 AM, Andrew Berg <bahamutzero8825@gmail.com> wrote: >> Also, be sure to >> use a raw string when composing REs, so you don't run into backslash >> issues. > How would I do that when grabbing strings from a config file (via the > configparser module)? Or rather, if I have a predefined variable > containing a string, how do change it into a raw string? > "Raw string" is slightly inaccurate. The Python "raw string literal" syntax is just another form of string literal: 'apostrophe-delimited string' "quote-delimited string" """triple-quote string which may go over multiple lines""" '''triple-apostrophe string''' r'raw apostrophe string' r"raw quote string" They're all equivalent once you have the string object. The only difference is how they appear in your source code. If you read something from a config file, you get a string object directly, and you delimit it with something else (end of line, or XML closing tag, or whatever), so you don't have to worry about string quotes. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-05-29 11:19 -0400 |
| Message-ID | <roy-7C21B7.11191129052011@news.panix.com> |
| In reply to | #6536 |
In article <mailman.2222.1306676482.9059.python-list@python.org>, Andrew Berg <bahamutzero8825@gmail.com> wrote: > Kodos is written in Python and uses Python's regex engine. In fact, it > is specifically intended to debug Python regexes. Named after the governor of Tarsus IV?
[toc] | [prev] | [next] | [standalone]
| From | Andrew Berg <bahamutzero8825@gmail.com> |
|---|---|
| Date | 2011-05-29 10:31 -0500 |
| Message-ID | <mailman.2224.1306683109.9059.python-list@python.org> |
| In reply to | #6545 |
On 2011.05.29 10:19 AM, Roy Smith wrote: > Named after the governor of Tarsus IV? Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html , it's named after the Simpsons character.
[toc] | [prev] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2011-05-29 21:06 +0200 |
| Message-ID | <4248834.rdbgypaU67@PointedEars.de> |
| In reply to | #6546 |
Andrew Berg wrote: > On 2011.05.29 10:19 AM, Roy Smith wrote: >> Named after the governor of Tarsus IV? > Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html , > it's named after the Simpsons character. <OT> I don't think that's a coincidence; both are from other planets and both are rather evil[tm]. Kodos the Executioner, arguably human, became a dictator who had thousands killed (by his own account, not to let the rest die of hunger); Kodos the slimy extra-terrestrial is a conqueror (and he likes to zap humans as well ;-)) [BTW, Tarsus IV, a planet where thousands (would) have died of hunger and have died in executions was probably yet another hidden Star Trek euphemism. I have found out that Tarsus is, among other things, the name of a collection of bones in the human foot next to the heel. Bones as a reference to death aside, see also Achilles for the heel. But I'm only speculating here.] </OT> -- \\//, PointedEars (F'up2 trek) Bitte keine Kopien per E-Mail. / Please do not Cc: me.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web