Groups > comp.lang.python > #76389 > unrolled thread

string encoding regex problem

Started by	Philipp Kraus <philipp.kraus@flashpixx.de>
First post	2014-08-16 02:27 +0200
Last post	2014-08-23 23:13 +0200
Articles	8 — 4 participants

Back to article view | Back to comp.lang.python

  string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-16 02:27 +0200
    Re: string encoding regex problem Roy Smith <roy@panix.com> - 2014-08-15 20:48 -0400
      Re: string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-16 04:08 +0200
        Re: string encoding regex problem Roy Smith <roy@panix.com> - 2014-08-15 22:14 -0400
        Re: string encoding regex problem Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-16 14:35 +1000
        Re: string encoding regex problem Peter Otten <__peter__@web.de> - 2014-08-16 11:01 +0200
          Re: string encoding regex problem Philipp Kraus <philipp.kraus@flashpixx.de> - 2014-08-23 22:46 +0200
            Re: string encoding regex problem Peter Otten <__peter__@web.de> - 2014-08-23 23:13 +0200

#76389 — string encoding regex problem

From	Philipp Kraus <philipp.kraus@flashpixx.de>
Date	2014-08-16 02:27 +0200
Subject	string encoding regex problem
Message-ID	<lsm8ic$j90$1@online.de>

[Multipart message — attachments visible in raw view] — view raw

Hello,

I have defined a function with:

def URLReader(url) :
    try :
        f = urllib2.urlopen(url)
        data = f.read()
        f.close()
    except Exception, e :
        raise MyError.StopError(e)
    return data

which get the HTML source code from an URL. I use this to get a part of 
a HTML document without any HTML parsing, so I call (I would like to 
get the download link of the boost library):

found = re.search( "<a 
href=\"/projects/boost/files/latest/download\?source=files\" 
title=\"/boost/(.*)", 
Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/") 
)
if found == None :
	raise MyError.StopError("Boost Download URL not found")

But found is always None, so I cannot get the correct match. I didn't 
find the error in my code.

Thanks for help

Phil

[toc] | [next] | [standalone]

#76390

From	Roy Smith <roy@panix.com>
Date	2014-08-15 20:48 -0400
Message-ID	<roy-7B58DA.20484615082014@news.panix.com>
In reply to	#76389

In article <lsm8ic$j90$1@online.de>,
 Philipp Kraus <philipp.kraus@flashpixx.de> wrote:

> found = re.search( "<a 
> href=\"/projects/boost/files/latest/download\?source=files\" 
> title=\"/boost/(.*)", 
> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/") 
> )
> if found == None :
> 	raise MyError.StopError("Boost Download URL not found")
> 
> But found is always None, so I cannot get the correct match. I didn't 
> find the error in my code.

I would start by breaking this down into pieces.  Something like:

> data = Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/") 
> )
> print data
> found = re.search( "<a 
> href=\"/projects/boost/files/latest/download\?source=files\" 
> title=\"/boost/(.*)",
> data)
> if found == None :
>  raise MyError.StopError("Boost Download URL not found")

Now at least you get to look at what URLReader() returned.  Did it 
return what you expected?  If not, then there might be something wrong 
in your URLReader() function.  If it is what you expected, then I would 
start looking at the pattern to see if it's correct.  Either way, you've 
managed to halve the size of the problem.

[toc] | [prev] | [next] | [standalone]

#76391

From	Philipp Kraus <philipp.kraus@flashpixx.de>
Date	2014-08-16 04:08 +0200
Message-ID	<lsmeej$49n$1@online.de>
In reply to	#76390

On 2014-08-16 00:48:46 +0000, Roy Smith said:

> In article <lsm8ic$j90$1@online.de>,
>  Philipp Kraus <philipp.kraus@flashpixx.de> wrote:
> 
>> found = re.search( "<a
>> href=\"/projects/boost/files/latest/download\?source=files\"
>> title=\"/boost/(.*)",
>> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/")
>> )
>> if found == None :
>> 	raise MyError.StopError("Boost Download URL not found")
>> 
>> But found is always None, so I cannot get the correct match. I didn't
>> find the error in my code.
> 
> I would start by breaking this down into pieces.  Something like:
> 
>> data = 
>> Utilities.URLReader("http://sourceforge.net/projects/boost/files/boost/") 
>> 
>> )
>> print data
>> found = re.search( "<a
>> href=\"/projects/boost/files/latest/download\?source=files\"
>> title=\"/boost/(.*)",
>> data)
>> if found == None :
>> raise MyError.StopError("Boost Download URL not found")
> 
> Now at least you get to look at what URLReader() returned.  Did it
> return what you expected?  If not, then there might be something wrong
> in your URLReader() function.

I have check the result of the (sorry, I forgot this information on my 
first post). The URLReader
returns the HTML code of the URL, so this seems to work correctly

>  If it is what you expected, then I would
> start looking at the pattern to see if it's correct.  Either way, you've
> managed to halve the size of the problem.

The code works till last week correctly, I don't change the pattern. My 
question is, can it be
a problem with string encoding? Did I mask the question mark and quotes 
correctly?

Phil

[toc] | [prev] | [next] | [standalone]

#76392

From	Roy Smith <roy@panix.com>
Date	2014-08-15 22:14 -0400
Message-ID	<roy-0CDBF3.22141315082014@news.panix.com>
In reply to	#76391

In article <lsmeej$49n$1@online.de>,
 Philipp Kraus <philipp.kraus@flashpixx.de> wrote:

> The code works till last week correctly, I don't change the pattern.

OK, so what did you change?  Can you go back to last week's code and 
compare it to what you have now to see what changed?

> My question is, can it be a problem with string encoding? Did I mask 
> the question mark and quotes correctly?

The best thing to do with regular expressions is to use raw strings, 
i.e. r'this is a string'.  The nice thing about that is backslashes are 
not special.  It makes it about 1000% easier to write complicated 
regular expressions.  Simple ones are only 500% easier.

[toc] | [prev] | [next] | [standalone]

#76394

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-16 14:35 +1000
Message-ID	<53eedf8f$0$29984$c3e8da3$5496439d@news.astraweb.com>
In reply to	#76391

Philipp Kraus wrote:

> The code works till last week correctly, I don't change the pattern. My
> question is, can it be
> a problem with string encoding? Did I mask the question mark and quotes
> correctly?

If you didn't change the code, how could the *exact same code* not mask the
question mark last week, but this week suddenly start masking it, despite
not changing?

There are three things that can cause a change in behaviour:

- the re module has changed;

- the pattern has changed;

- the text you are searching has changed.

Have you removed the re module and replaced it with a different one? Did you
update Python to a new version?

Have you changed the regex search pattern?

Has the text you are searching changed? Websites upgrade their HTML quite
frequently. Perhaps the Boost website has changed enough to break your
regex.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#76401

From	Peter Otten <__peter__@web.de>
Date	2014-08-16 11:01 +0200
Message-ID	<mailman.13048.1408179734.18130.python-list@python.org>
In reply to	#76391

Philipp Kraus wrote:

> The code works till last week correctly, I don't change the pattern. 

Websites' contents and structure change sometimes.

> My question is, can it be a problem with string encoding? 

Your regex is all-ascii. So an encoding problem is very unlikely.

> found = re.search( "<a 
> href=\"/projects/boost/files/latest/download\?source=files\" 
> title=\"/boost/(.*)",
> data)

> Did I mask the question mark and quotes
> correctly?

Yes.

A quick check...

>>> data = urllib.urlopen("http://sourceforge.net/projects/boost/files/boost/").read()
>>> re.compile("/projects/boost/files/latest/download\?source=files.*?>").findall(data)
['/projects/boost/files/latest/download?source=files" title="/boost-docs/1.56.0/boost_1_56_pdf.7z:  released on 2014-08-14 16:35:00 UTC">']

...reveals that the matching link has "/boost-docs/" in its title, so the
 site contents probably did change.

[toc] | [prev] | [next] | [standalone]

#76898

From	Philipp Kraus <philipp.kraus@flashpixx.de>
Date	2014-08-23 22:46 +0200
Message-ID	<ltaujl$3kj$1@online.de>
In reply to	#76401

[Multipart message — attachments visible in raw view] — view raw

Hi,

On 2014-08-16 09:01:57 +0000, Peter Otten said:

> Philipp Kraus wrote:
> 
>> The code works till last week correctly, I don't change the pattern. 
> 
> Websites' contents and structure change sometimes.
> 
>> My question is, can it be a problem with string encoding? 
> 
> Your regex is all-ascii. So an encoding problem is very unlikely.
> 
>> found = re.search( "<a 
>> href=\"/projects/boost/files/latest/download\?source=files\" 
>> title=\"/boost/(.*)",
>> data)
> 
>> Did I mask the question mark and quotes
>> correctly?
> 
> Yes.
> 
> A quick check...
> 
>>>> data = 
>>>> urllib.urlopen("http://sourceforge.net/projects/boost/files/boost/").read() 
>>>> 
>>>> re.compile("/projects/boost/files/latest/download\?source=files.*?>").findall(data) 
>>>> 
> ['/projects/boost/files/latest/download?source=files" 
> title="/boost-docs/1.56.0/boost_1_56_pdf.7z:  released on 2014-08-14 
> 16:35:00 UTC">']
> 
> ...reveals that the matching link has "/boost-docs/" in its title, so the
>  site contents probably did change. 

I have create a short script:

---------
#!/usr/bin/env python

import re, urllib2


def URLReader(url) :
    f = urllib2.urlopen(url)
    data = f.read()
    f.close()
    return data


print re.match( "\<small\ \>.*\<\/small\>", 
URLReader("http://sourceforge.net/projects/boost/") )
---------

Within the data the string "<small>boost_1_56_0.tar.gz</small>" should 
be machted, but I get always a None result on the re.match, re.search 
returns also a None.
I have tested the regex under http://regex101.com/ with the HTML code 
and on the page the regex is matched.

Can you help me please to fix the problem, I don't understand that the 
match returns None

Thanks

Phil

[toc] | [prev] | [next] | [standalone]

#76901

From	Peter Otten <__peter__@web.de>
Date	2014-08-23 23:13 +0200
Message-ID	<mailman.13356.1408828401.18130.python-list@python.org>
In reply to	#76898

Philipp Kraus wrote:

> I have create a short script:
> 
> ---------
> #!/usr/bin/env python
> 
> import re, urllib2
> 
> 
> def URLReader(url) :
>     f = urllib2.urlopen(url)
>     data = f.read()
>     f.close()
>     return data
> 
> 
> print re.match( "\<small\ \>.*\<\/small\>",
> URLReader("http://sourceforge.net/projects/boost/") )
> ---------
> 
> Within the data the string "<small>boost_1_56_0.tar.gz</small>" should
> be machted, but I get always a None result on the re.match, re.search
> returns also a None.

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

As the string doesn't start with your regex re.match() is clearly wrong, but 
re.search() works for me:

>>> import re, urllib2
>>> 
>>> 
>>> def URLReader(url) :
...     f = urllib2.urlopen(url)
...     data = f.read()
...     f.close()
...     return data
... 
>>> data = URLReader("http://sourceforge.net/projects/boost/")
>>> re.search("\<small\ \>.*\<\/small\>", data)
<_sre.SRE_Match object at 0x7f282dd58718>
>>> _.group()
'<small >boost_1_56_pdf.7z</small>'


> I have tested the regex under http://regex101.com/ with the HTML code
> and on the page the regex is matched.
> 
> Can you help me please to fix the problem, I don't understand that the
> match returns None

[toc] | [prev] | [standalone]

csiph-web

string encoding regex problem

Contents

#76389 — string encoding regex problem

#76390

#76391

#76392

#76394

#76401

#76898

#76901