Groups > comp.lang.python > #76147 > unrolled thread

Suitable Python code to scrape specific details from web pages.

Started by	Simon Evans <musicalhacksaw@yahoo.co.uk>
First post	2014-08-12 13:00 -0700
Last post	2014-08-13 14:53 +0000
Articles	9 — 7 participants

Back to article view | Back to comp.lang.python

  Suitable Python code to scrape specific details from  web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 13:00 -0700
    Re: Suitable Python code to scrape specific details from  web pages. Rob Gaddi <rgaddi@technologyhighland.invalid> - 2014-08-12 13:11 -0700
    Re: Suitable Python code to scrape specific details from  web pages. Roy Smith <roy@panix.com> - 2014-08-12 17:28 -0400
      Re: Suitable Python code to scrape specific details from  web pages. alex23 <wuwei23@gmail.com> - 2014-08-18 15:04 +1000
    Re: Suitable Python code to scrape specific details from  web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 15:44 -0700
      Re: Suitable Python code to scrape specific details from  web pages. Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-13 10:04 +1000
        Re: Suitable Python code to scrape specific details from  web pages. Roy Smith <roy@panix.com> - 2014-08-12 20:30 -0400
      Re: Suitable Python code to scrape specific details from  web pages. Peter Pearson <ppearson@nowhere.invalid> - 2014-08-13 00:50 +0000
    Re: Suitable Python code to scrape specific details from  web pages. Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-13 14:53 +0000

#76147 — Suitable Python code to scrape specific details from web pages.

From	Simon Evans <musicalhacksaw@yahoo.co.uk>
Date	2014-08-12 13:00 -0700
Subject	Suitable Python code to scrape specific details from web pages.
Message-ID	<a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com>

Dear Programmers,
I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. 
I would be glad if you could tell me where I am going wrong. 
Yours faithfully
Simon Evans.
--------------------------------------------------------------------------------
>>>import urllib
>>>import re
>>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?

race_id=600048r_date=2014-05-08#raceTabs=sc_")
htmltext = htmlfile.read()
regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?

horse_id=758752"onclick="scorecards.send(&quot;horse_name&quot:):return Html.popup(this,

{width:695,height:800})"title="Full details about this HORSE">Lively 

Baron</a>9/4F</strong><br/>'
>>>pattern = re.compile(regex)
>>>odds=re.findall(pattern,htmltext)
>>>print odds
[]
>>>
--------------------------------------------------------------------------------
>>>import urllib
>>>import re
>>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?

>>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
>>>htmltext = htmlfile.read()
>>>regex = '<a></a>'
>>>pattern = re.compile(regex)
>>>odds=re.findall(pattern,htmltext)
>>>print odds
[]
>>>
-------------------------------------------------------------------------------

[toc] | [next] | [standalone]

#76148

From	Rob Gaddi <rgaddi@technologyhighland.invalid>
Date	2014-08-12 13:11 -0700
Message-ID	<20140812131147.5c99507c@rg.highlandtechnology.com>
In reply to	#76147

On Tue, 12 Aug 2014 13:00:30 -0700 (PDT)
Simon Evans <musicalhacksaw@yahoo.co.uk> wrote:

> Dear Programmers,
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. 
> I would be glad if you could tell me where I am going wrong. 
> Yours faithfully
> Simon Evans.
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> race_id=600048r_date=2014-05-08#raceTabs=sc_")
> htmltext = htmlfile.read()
> regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
> 
> horse_id=758752"onclick="scorecards.send(&quot;horse_name&quot:):return Html.popup(this,
> 
> {width:695,height:800})"title="Full details about this HORSE">Lively 
> 
> Baron</a>9/4F</strong><br/>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> >>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
> >>>htmltext = htmlfile.read()
> >>>regex = '<a></a>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> -------------------------------------------------------------------------------

If you want web scraping, you want to use
http://www.crummy.com/software/BeautifulSoup/ .  End of story.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

[toc] | [prev] | [next] | [standalone]

#76151

From	Roy Smith <roy@panix.com>
Date	2014-08-12 17:28 -0400
Message-ID	<roy-AD3509.17281512082014@news.panix.com>
In reply to	#76147

In article <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com>,
 Simon Evans <musicalhacksaw@yahoo.co.uk> wrote:

> Dear Programmers,
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. 
> I have tried a few of his python programs in the Python27 command prompt, but 
> altered them from accessing data using links say from the Dow Jones index, to 
> accessing the details I would be interested in accessing from the 'Racing 
> Post' on a daily basis. Anyhow, the code it returns is not in the example I 
> am going to give, is not the information I am seeking, instead of returning 
> the given odds on a horse, it only returns a [], which isn't much use. 
> I would be glad if you could tell me where I am going wrong. 

Rather than comment on your specific code (but, thank you for posting 
it), I'll make a couple of more generic suggestions.

First, if you're doing anything with fetching web pages, install the 
wonderful requests module (http://docs.python-requests.org/en/latest/).  
It's so much easier to work with than urllib.

Second, if you're going to be parsing web pages, trying to use regexes 
is a losing game.  You need something that knows how to parse HTML.  The 
canonical answer is lxml (http://lxml.de/), but Beautiful Soup 
(http://www.crummy.com/software/BeautifulSoup/) is less intimidating to 
use.

[toc] | [prev] | [next] | [standalone]

#76450

From	alex23 <wuwei23@gmail.com>
Date	2014-08-18 15:04 +1000
Message-ID	<lss1gu$qaf$1@dont-email.me>
In reply to	#76151

On 13/08/2014 7:28 AM, Roy Smith wrote:
> Second, if you're going to be parsing web pages, trying to use regexes
> is a losing game.  You need something that knows how to parse HTML.  The
> canonical answer is lxml (http://lxml.de/), but Beautiful Soup
> (http://www.crummy.com/software/BeautifulSoup/) is less intimidating to
> use.

lxml also has a BeautifulSoup parser, so you can easily mix and match 
approaches:

http://lxml.de/elementsoup.html

[toc] | [prev] | [next] | [standalone]

#76154

From	Simon Evans <musicalhacksaw@yahoo.co.uk>
Date	2014-08-12 15:44 -0700
Message-ID	<e2011de5-10fa-4de1-89fa-4e41882a6646@googlegroups.com>
In reply to	#76147

On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote:
> Dear Programmers,
> 
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. 
> 
> I would be glad if you could tell me where I am going wrong. 
> 
> Yours faithfully
> 
> Simon Evans.
> 
> --------------------------------------------------------------------------------
> 
> >>>import urllib
> 
> >>>import re
> 
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> 
> 
> race_id=600048r_date=2014-05-08#raceTabs=sc_")
> 
> htmltext = htmlfile.read()
> 
> regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
> 
> 
> 
> horse_id=758752"onclick="scorecards.send(&quot;horse_name&quot:):return Html.popup(this,
> 
> 
> 
> {width:695,height:800})"title="Full details about this HORSE">Lively 
> 
> 
> 
> Baron</a>9/4F</strong><br/>'
> 
> >>>pattern = re.compile(regex)
> 
> >>>odds=re.findall(pattern,htmltext)
> 
> >>>print odds
> 
> []
> 
> >>>
> 
> --------------------------------------------------------------------------------
> 
> >>>import urllib
> 
> >>>import re
> 
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> 
> 
> >>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
> 
> >>>htmltext = htmlfile.read()
> 
> >>>regex = '<a></a>'
> 
> >>>pattern = re.compile(regex)
> 
> >>>odds=re.findall(pattern,htmltext)
> 
> >>>print odds
> 
> []
> 
> >>>
> 
> -------------------------------------------------------------------------------
Dear Programmers, Thank you for your responses. I have installed 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem to make  any progress with it, I am too thick to make much use of it. I was hoping I could scrape specified stuff off Web pages without using it. I have installed 'Requests' also, is there any code I can use that you can suggest that can access the sort of Web page values that I have referred to ?  such as odds, names of runners, stuff like that off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com.

[toc] | [prev] | [next] | [standalone]

#76155

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-08-13 10:04 +1000
Message-ID	<53eaab7d$0$29979$c3e8da3$5496439d@news.astraweb.com>
In reply to	#76154

Simon Evans wrote:

> Dear Programmers, Thank you for your responses. I have installed
> 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book,
> but can't seem to make  any progress with it, I am too thick to make much
> use of it. I was hoping I could scrape specified stuff off Web pages
> without using it.

Yes, you can scrape stuff off web pages without programming. What you do is
you open the web page in your browser, then open a notebook and, with a
pencil or pen, copy the bits you read into the notebook.

If you're very skilled, you can avoid the pencil and paper and type directly
into a text editor on the computer.

But other than that, every website is different, so there is no short-cut to
web scraping. You need to customize the scraping code for each website you
scrape, and that means programming. Do you know how to program? Are you
interested in learning? If the answer is No and No, then I suggestion you
pony up some money and pay somebody who already knows how to program to do
the job for you.

If the answer is No and Yes, then start at the beginning. Do some
programming tutorials, learn to program the basics before moving on to
something moderately difficult like web scraping.

If the answer is that you already know how to program, but just don't know
how to do web scraping, then stick with it and you'll get there. Web
scraping is tricky, but possible, and if you work hard at it you'll
succeed. Unless you're an experienced programmer with all the right skills,
don't expect this to be something you do in a few minutes. Depending on
your level of experience, you could expect to spend dozens of hours to
learn how to scrape a single website. (Fortunately, the second website will
probably be a little easier, and the third easier still. By the time you've
done a dozen, you'll wonder what the fuss was about.) 

By studying how other scraping programs work, and studying how your racing
pages store data, you should be able to put the two together and see how to
get the data you want. There's plenty of information to help you learn how
to web scrape, with or without BeautifulSoup:

https://startpage.com/do/search/?q=beautifulsoup+web+scraping

https://ixquick.com/do/search/?q=python+web+scraping+examples

https://duckduckgo.com/html/?q=requests%20python%20web%20scraping%20example

but no alternative to actually writing code.

> I have installed 'Requests' also, is there any code I 
> can use that you can suggest that can access the sort of Web page values
> that I have referred to ?  such as odds, names of runners, stuff like that
> off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com.

Specifically those pages? Doubtful.

If you are really lucky (1) somebody else has already done the programming,
(2) they've made their program available to others, and (3) you can find
that program on the Internet. Use the search engine of your choice to
search for it.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#76156

From	Roy Smith <roy@panix.com>
Date	2014-08-12 20:30 -0400
Message-ID	<roy-008918.20303912082014@news.panix.com>
In reply to	#76155

In article <53eaab7d$0$29979$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> By studying how other scraping programs work, and studying how your racing
> pages store data, you should be able to put the two together and see how to
> get the data you want.

It's also worth mentioning, that some web sites *want* you to have their 
data, and make it easy to do so by exposing it via public APIs or other 
download methods.  Wikipedia.  Many government web sites.  Twitter.  
Facebook.  Reddit.

Whenever you start thinking about web scraping, it's always worth 
spending a little time investigating if such an API exists.  If it does, 
that's where you want to go.  If not, well, there's always Beautiful 
Soup :-)

[toc] | [prev] | [next] | [standalone]

#76158

From	Peter Pearson <ppearson@nowhere.invalid>
Date	2014-08-13 00:50 +0000
Message-ID	<c4vr3fFd48mU1@mid.individual.net>
In reply to	#76154

On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote:
[snip]
> Dear Programmers, Thank you for your responses. I have installed
> 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup'
> book, but can't seem to make any progress with it, I am too thick to
> make much use of it. I was hoping I could scrape specified stuff off
> Web pages without using it.

I've only used BeautifulSoup a little bit, and am no expert, but
with it one can do wonderfully complex things with simple code.
Perhaps you can find some examples online; this newsgroup sometimes
has awesome demonstrations of BS prowess.

At the risk of embarrassing myself in public, I'll show you some
code I wrote that scrapes data from a web page containing a
description of a drug.  The drug's web page contains the desired
data in tags that look like this:

<input id="form-widgets-minconcentration" name="form.widgets.minconcentration"
class="text-widget float-field" value="1.0" type="text" />

The following code finds all these tags and builds a dict by which you
can lookup the "value" for any given "name".

    from BeautifulSoup import BeautifulSoup as BS
    ...

    def dump_drug_data(url):
        """Fetch data from one drug's URL and print selected fields in columns.
        """
        contents = urllib2.urlopen(url=url).read()
        soup = BS(contents)
        inputs = soup.findAll("input")
        input_dict = dict((i.get("name"), i.get("value")) for i in inputs)
        print(" ".join(f.format(input_dict[n]) for f, n in (
                    ("{0:5s}", "form.widgets.absorption_halflife"),
                    ("{0:5s}", "form.widgets.elimination_halflife"),
                    ("{0:5s}", "form.widgets.minconcentration"),
                    ("{0:5s}", "form.widgets.maxconcentration"),
                    ("{0:13s}", "form.widgets.title"),
                    )))

Try giving a more specific picture of your quest, and it's very
likely that people smarter than me will give you good help.

-- 
To email me, substitute nowhere->spamcop, invalid->net.

[toc] | [prev] | [next] | [standalone]

#76206

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2014-08-13 14:53 +0000
Message-ID	<lsfu5l$o7d$3@dont-email.me>
In reply to	#76147

On Tue, 12 Aug 2014 13:00:30 -0700, Simon Evans wrote:

> in accessing from the 'Racing Post' on a daily basis. Anyhow, the code

Following is some starter code. You will have to look at the output, 
compare it to the web page, and work out how you want to process it 
further. Note that I use beautifulsoup and requests. The output is the 
html for each cell in the table with a line of "+" characters at the 
table row breaks. I suggest you look at the beautifulsoup documentation 
at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ to work out how 
you may wish to select which table cells contain data you are interested 
in and how to extract it.

#!/usr/bin/python
"""
Program to extract data from racingpost.
"""

from bs4 import BeautifulSoup
import requests

r = requests.get( "http://www.racingpost.com/horses2/cards/card.sd?
race_id=607466&r_date=2014-08-13#raceTabs=sc_" )

if r.status_code == 200:
    soup = BeautifulSoup( r.content )
    table = soup.find( "table", id="sc_horseCard" )
    for row in table.find_all( "tr" ):
        for cell in row.find_all( "td" ):
            print cell
        print "+++++++++++++++++++++++++++++++++++++"
else:
    print "HTTP Status", r.status_code

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]

csiph-web

Suitable Python code to scrape specific details from web pages.

Contents

#76147 — Suitable Python code to scrape specific details from web pages.

#76148

#76151

#76450

#76154

#76155

#76156

#76158

#76206