Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #76389 > unrolled thread
| Started by | Philipp Kraus <philipp.kraus@flashpixx.de> |
|---|---|
| First post | 2014-08-16 02:27 +0200 |
| Last post | 2014-08-23 23:13 +0200 |
| Articles | 8 — 4 participants |
Back to article view | Back to comp.lang.python
string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-16 02:27 +0200
Re: string encoding regex problem Roy Smith <roy@panix.com> - 2014-08-15 20:48 -0400
Re: string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-16 04:08 +0200
Re: string encoding regex problem Roy Smith <roy@panix.com> - 2014-08-15 22:14 -0400
Re: string encoding regex problem Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-16 14:35 +1000
Re: string encoding regex problem Peter Otten <__peter__@web.de> - 2014-08-16 11:01 +0200
Re: string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-23 22:46 +0200
Re: string encoding regex problem Peter Otten <__peter__@web.de> - 2014-08-23 23:13 +0200
| From | Philipp Kraus <philipp.kraus@flashpixx.de> |
|---|---|
| Date | 2014-08-16 02:27 +0200 |
| Subject | string encoding regex problem |
| Message-ID | <lsm8ic$j90$1@online.de> |
[Multipart message — attachments visible in raw view] — view raw
Hello,
I have defined a function with:
def URLReader(url) :
try :
f = urllib2.urlopen(url)
data = f.read()
f.close()
except Exception, e :
raise MyError.StopError(e)
return data
which get the HTML source code from an URL. I use this to get a part of
a HTML document without any HTML parsing, so I call (I would like to
get the download link of the boost library):
found = re.search( "<a
href=\"/projects/boost/files/latest/download\?source=files\"
title=\"/boost/(.*)",
Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
)
if found == None :
raise MyError.StopError("Boost Download URL not found")
But found is always None, so I cannot get the correct match. I didn't
find the error in my code.
Thanks for help
Phil
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-08-15 20:48 -0400 |
| Message-ID | <roy-7B58DA.20484615082014@news.panix.com> |
| In reply to | #76389 |
In article <lsm8ic$j90$1@online.de>,
Philipp Kraus <philipp.kraus@flashpixx.de> wrote:
> found = re.search( "<a
> href=\"/projects/boost/files/latest/download\?source=files\"
> title=\"/boost/(.*)",
> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
> )
> if found == None :
> raise MyError.StopError("Boost Download URL not found")
>
> But found is always None, so I cannot get the correct match. I didn't
> find the error in my code.
I would start by breaking this down into pieces. Something like:
> data = Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
> )
> print data
> found = re.search( "<a
> href=\"/projects/boost/files/latest/download\?source=files\"
> title=\"/boost/(.*)",
> data)
> if found == None :
> raise MyError.StopError("Boost Download URL not found")
Now at least you get to look at what URLReader() returned. Did it
return what you expected? If not, then there might be something wrong
in your URLReader() function. If it is what you expected, then I would
start looking at the pattern to see if it's correct. Either way, you've
managed to halve the size of the problem.
[toc] | [prev] | [next] | [standalone]
| From | Philipp Kraus <philipp.kraus@flashpixx.de> |
|---|---|
| Date | 2014-08-16 04:08 +0200 |
| Message-ID | <lsmeej$49n$1@online.de> |
| In reply to | #76390 |
On 2014-08-16 00:48:46 +0000, Roy Smith said:
> In article <lsm8ic$j90$1@online.de>,
> Philipp Kraus <philipp.kraus@flashpixx.de> wrote:
>
>> found = re.search( "<a
>> href=\"/projects/boost/files/latest/download\?source=files\"
>> title=\"/boost/(.*)",
>> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
>> )
>> if found == None :
>> raise MyError.StopError("Boost Download URL not found")
>>
>> But found is always None, so I cannot get the correct match. I didn't
>> find the error in my code.
>
> I would start by breaking this down into pieces. Something like:
>
>> data =
>> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
>>
>> )
>> print data
>> found = re.search( "<a
>> href=\"/projects/boost/files/latest/download\?source=files\"
>> title=\"/boost/(.*)",
>> data)
>> if found == None :
>> raise MyError.StopError("Boost Download URL not found")
>
> Now at least you get to look at what URLReader() returned. Did it
> return what you expected? If not, then there might be something wrong
> in your URLReader() function.
I have check the result of the (sorry, I forgot this information on my
first post). The URLReader
returns the HTML code of the URL, so this seems to work correctly
> If it is what you expected, then I would
> start looking at the pattern to see if it's correct. Either way, you've
> managed to halve the size of the problem.
The code works till last week correctly, I don't change the pattern. My
question is, can it be
a problem with string encoding? Did I mask the question mark and quotes
correctly?
Phil
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-08-15 22:14 -0400 |
| Message-ID | <roy-0CDBF3.22141315082014@news.panix.com> |
| In reply to | #76391 |
In article <lsmeej$49n$1@online.de>, Philipp Kraus <philipp.kraus@flashpixx.de> wrote: > The code works till last week correctly, I don't change the pattern. OK, so what did you change? Can you go back to last week's code and compare it to what you have now to see what changed? > My question is, can it be a problem with string encoding? Did I mask > the question mark and quotes correctly? The best thing to do with regular expressions is to use raw strings, i.e. r'this is a string'. The nice thing about that is backslashes are not special. It makes it about 1000% easier to write complicated regular expressions. Simple ones are only 500% easier.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-08-16 14:35 +1000 |
| Message-ID | <53eedf8f$0$29984$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #76391 |
Philipp Kraus wrote: > The code works till last week correctly, I don't change the pattern. My > question is, can it be > a problem with string encoding? Did I mask the question mark and quotes > correctly? If you didn't change the code, how could the *exact same code* not mask the question mark last week, but this week suddenly start masking it, despite not changing? There are three things that can cause a change in behaviour: - the re module has changed; - the pattern has changed; - the text you are searching has changed. Have you removed the re module and replaced it with a different one? Did you update Python to a new version? Have you changed the regex search pattern? Has the text you are searching changed? Websites upgrade their HTML quite frequently. Perhaps the Boost website has changed enough to break your regex. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-08-16 11:01 +0200 |
| Message-ID | <mailman.13048.1408179734.18130.python-list@python.org> |
| In reply to | #76391 |
Philipp Kraus wrote:
> The code works till last week correctly, I don't change the pattern.
Websites' contents and structure change sometimes.
> My question is, can it be a problem with string encoding?
Your regex is all-ascii. So an encoding problem is very unlikely.
> found = re.search( "<a
> href=\"/projects/boost/files/latest/download\?source=files\"
> title=\"/boost/(.*)",
> data)
> Did I mask the question mark and quotes
> correctly?
Yes.
A quick check...
>>> data = urllib.urlopen("http://sourceforge.net/projects/boost/files/boost/").read()
>>> re.compile("/projects/boost/files/latest/download\?source=files.*?>").findall(data)
['/projects/boost/files/latest/download?source=files" title="/boost-docs/1.56.0/boost_1_56_pdf.7z: released on 2014-08-14 16:35:00 UTC">']
...reveals that the matching link has "/boost-docs/" in its title, so the
site contents probably did change.
[toc] | [prev] | [next] | [standalone]
| From | Philipp Kraus <philipp.kraus@flashpixx.de> |
|---|---|
| Date | 2014-08-23 22:46 +0200 |
| Message-ID | <ltaujl$3kj$1@online.de> |
| In reply to | #76401 |
[Multipart message — attachments visible in raw view] — view raw
Hi,
On 2014-08-16 09:01:57 +0000, Peter Otten said:
> Philipp Kraus wrote:
>
>> The code works till last week correctly, I don't change the pattern.
>
> Websites' contents and structure change sometimes.
>
>> My question is, can it be a problem with string encoding?
>
> Your regex is all-ascii. So an encoding problem is very unlikely.
>
>> found = re.search( "<a
>> href=\"/projects/boost/files/latest/download\?source=files\"
>> title=\"/boost/(.*)",
>> data)
>
>> Did I mask the question mark and quotes
>> correctly?
>
> Yes.
>
> A quick check...
>
>>>> data =
>>>> urllib.urlopen("http://sourceforge.net/projects/boost/files/boost/").read()
>>>>
>>>> re.compile("/projects/boost/files/latest/download\?source=files.*?>").findall(data)
>>>>
> ['/projects/boost/files/latest/download?source=files"
> title="/boost-docs/1.56.0/boost_1_56_pdf.7z: released on 2014-08-14
> 16:35:00 UTC">']
>
> ...reveals that the matching link has "/boost-docs/" in its title, so the
> site contents probably did change.
I have create a short script:
---------
#!/usr/bin/env python
import re, urllib2
def URLReader(url) :
f = urllib2.urlopen(url)
data = f.read()
f.close()
return data
print re.match( "\<small\ \>.*\<\/small\>",
URLReader("http://sourceforge.net/projects/boost/") )
---------
Within the data the string "<small>boost_1_56_0.tar.gz</small>" should
be machted, but I get always a None result on the re.match, re.search
returns also a None.
I have tested the regex under http://regex101.com/ with the HTML code
and on the page the regex is matched.
Can you help me please to fix the problem, I don't understand that the
match returns None
Thanks
Phil
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2014-08-23 23:13 +0200 |
| Message-ID | <mailman.13356.1408828401.18130.python-list@python.org> |
| In reply to | #76898 |
Philipp Kraus wrote:
> I have create a short script:
>
> ---------
> #!/usr/bin/env python
>
> import re, urllib2
>
>
> def URLReader(url) :
> f = urllib2.urlopen(url)
> data = f.read()
> f.close()
> return data
>
>
> print re.match( "\<small\ \>.*\<\/small\>",
> URLReader("http://sourceforge.net/projects/boost/") )
> ---------
>
> Within the data the string "<small>boost_1_56_0.tar.gz</small>" should
> be machted, but I get always a None result on the re.match, re.search
> returns also a None.
>>> help(re.match)
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
As the string doesn't start with your regex re.match() is clearly wrong, but
re.search() works for me:
>>> import re, urllib2
>>>
>>>
>>> def URLReader(url) :
... f = urllib2.urlopen(url)
... data = f.read()
... f.close()
... return data
...
>>> data = URLReader("http://sourceforge.net/projects/boost/")
>>> re.search("\<small\ \>.*\<\/small\>", data)
<_sre.SRE_Match object at 0x7f282dd58718>
>>> _.group()
'<small >boost_1_56_pdf.7z</small>'
> I have tested the regex under http://regex101.com/ with the HTML code
> and on the page the regex is matched.
>
> Can you help me please to fix the problem, I don't understand that the
> match returns None
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web