Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #76147 > unrolled thread
| Started by | Simon Evans <musicalhacksaw@yahoo.co.uk> |
|---|---|
| First post | 2014-08-12 13:00 -0700 |
| Last post | 2014-08-13 14:53 +0000 |
| Articles | 9 — 7 participants |
Back to article view | Back to comp.lang.python
Suitable Python code to scrape specific details from web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 13:00 -0700
Re: Suitable Python code to scrape specific details from web pages. Rob Gaddi <rgaddi@technologyhighland.invalid> - 2014-08-12 13:11 -0700
Re: Suitable Python code to scrape specific details from web pages. Roy Smith <roy@panix.com> - 2014-08-12 17:28 -0400
Re: Suitable Python code to scrape specific details from web pages. alex23 <wuwei23@gmail.com> - 2014-08-18 15:04 +1000
Re: Suitable Python code to scrape specific details from web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 15:44 -0700
Re: Suitable Python code to scrape specific details from web pages. Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-13 10:04 +1000
Re: Suitable Python code to scrape specific details from web pages. Roy Smith <roy@panix.com> - 2014-08-12 20:30 -0400
Re: Suitable Python code to scrape specific details from web pages. Peter Pearson <ppearson@nowhere.invalid> - 2014-08-13 00:50 +0000
Re: Suitable Python code to scrape specific details from web pages. Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-13 14:53 +0000
| From | Simon Evans <musicalhacksaw@yahoo.co.uk> |
|---|---|
| Date | 2014-08-12 13:00 -0700 |
| Subject | Suitable Python code to scrape specific details from web pages. |
| Message-ID | <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com> |
Dear Programmers,
I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use.
I would be glad if you could tell me where I am going wrong.
Yours faithfully
Simon Evans.
--------------------------------------------------------------------------------
>>>import urllib
>>>import re
>>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
race_id=600048r_date=2014-05-08#raceTabs=sc_")
htmltext = htmlfile.read()
regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
horse_id=758752"onclick="scorecards.send("horse_name":):return Html.popup(this,
{width:695,height:800})"title="Full details about this HORSE">Lively
Baron</a>9/4F</strong><br/>'
>>>pattern = re.compile(regex)
>>>odds=re.findall(pattern,htmltext)
>>>print odds
[]
>>>
--------------------------------------------------------------------------------
>>>import urllib
>>>import re
>>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
>>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
>>>htmltext = htmlfile.read()
>>>regex = '<a></a>'
>>>pattern = re.compile(regex)
>>>odds=re.findall(pattern,htmltext)
>>>print odds
[]
>>>
-------------------------------------------------------------------------------
[toc] | [next] | [standalone]
| From | Rob Gaddi <rgaddi@technologyhighland.invalid> |
|---|---|
| Date | 2014-08-12 13:11 -0700 |
| Message-ID | <20140812131147.5c99507c@rg.highlandtechnology.com> |
| In reply to | #76147 |
On Tue, 12 Aug 2014 13:00:30 -0700 (PDT)
Simon Evans <musicalhacksaw@yahoo.co.uk> wrote:
> Dear Programmers,
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use.
> I would be glad if you could tell me where I am going wrong.
> Yours faithfully
> Simon Evans.
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
>
> race_id=600048r_date=2014-05-08#raceTabs=sc_")
> htmltext = htmlfile.read()
> regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
>
> horse_id=758752"onclick="scorecards.send("horse_name":):return Html.popup(this,
>
> {width:695,height:800})"title="Full details about this HORSE">Lively
>
> Baron</a>9/4F</strong><br/>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
>
> >>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
> >>>htmltext = htmlfile.read()
> >>>regex = '<a></a>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> -------------------------------------------------------------------------------
If you want web scraping, you want to use
http://www.crummy.com/software/BeautifulSoup/ . End of story.
--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-08-12 17:28 -0400 |
| Message-ID | <roy-AD3509.17281512082014@news.panix.com> |
| In reply to | #76147 |
In article <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com>, Simon Evans <musicalhacksaw@yahoo.co.uk> wrote: > Dear Programmers, > I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. > I have tried a few of his python programs in the Python27 command prompt, but > altered them from accessing data using links say from the Dow Jones index, to > accessing the details I would be interested in accessing from the 'Racing > Post' on a daily basis. Anyhow, the code it returns is not in the example I > am going to give, is not the information I am seeking, instead of returning > the given odds on a horse, it only returns a [], which isn't much use. > I would be glad if you could tell me where I am going wrong. Rather than comment on your specific code (but, thank you for posting it), I'll make a couple of more generic suggestions. First, if you're doing anything with fetching web pages, install the wonderful requests module (http://docs.python-requests.org/en/latest/). It's so much easier to work with than urllib. Second, if you're going to be parsing web pages, trying to use regexes is a losing game. You need something that knows how to parse HTML. The canonical answer is lxml (http://lxml.de/), but Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is less intimidating to use.
[toc] | [prev] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2014-08-18 15:04 +1000 |
| Message-ID | <lss1gu$qaf$1@dont-email.me> |
| In reply to | #76151 |
On 13/08/2014 7:28 AM, Roy Smith wrote: > Second, if you're going to be parsing web pages, trying to use regexes > is a losing game. You need something that knows how to parse HTML. The > canonical answer is lxml (http://lxml.de/), but Beautiful Soup > (http://www.crummy.com/software/BeautifulSoup/) is less intimidating to > use. lxml also has a BeautifulSoup parser, so you can easily mix and match approaches: http://lxml.de/elementsoup.html
[toc] | [prev] | [next] | [standalone]
| From | Simon Evans <musicalhacksaw@yahoo.co.uk> |
|---|---|
| Date | 2014-08-12 15:44 -0700 |
| Message-ID | <e2011de5-10fa-4de1-89fa-4e41882a6646@googlegroups.com> |
| In reply to | #76147 |
On Tuesday, August 12, 2014 9:00:30 PM UTC+1, Simon Evans wrote:
> Dear Programmers,
>
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use.
>
> I would be glad if you could tell me where I am going wrong.
>
> Yours faithfully
>
> Simon Evans.
>
> --------------------------------------------------------------------------------
>
> >>>import urllib
>
> >>>import re
>
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
>
>
>
> race_id=600048r_date=2014-05-08#raceTabs=sc_")
>
> htmltext = htmlfile.read()
>
> regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
>
>
>
> horse_id=758752"onclick="scorecards.send("horse_name":):return Html.popup(this,
>
>
>
> {width:695,height:800})"title="Full details about this HORSE">Lively
>
>
>
> Baron</a>9/4F</strong><br/>'
>
> >>>pattern = re.compile(regex)
>
> >>>odds=re.findall(pattern,htmltext)
>
> >>>print odds
>
> []
>
> >>>
>
> --------------------------------------------------------------------------------
>
> >>>import urllib
>
> >>>import re
>
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
>
>
>
> >>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
>
> >>>htmltext = htmlfile.read()
>
> >>>regex = '<a></a>'
>
> >>>pattern = re.compile(regex)
>
> >>>odds=re.findall(pattern,htmltext)
>
> >>>print odds
>
> []
>
> >>>
>
> -------------------------------------------------------------------------------
Dear Programmers, Thank you for your responses. I have installed 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, but can't seem to make any progress with it, I am too thick to make much use of it. I was hoping I could scrape specified stuff off Web pages without using it. I have installed 'Requests' also, is there any code I can use that you can suggest that can access the sort of Web page values that I have referred to ? such as odds, names of runners, stuff like that off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-08-13 10:04 +1000 |
| Message-ID | <53eaab7d$0$29979$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #76154 |
Simon Evans wrote: > Dear Programmers, Thank you for your responses. I have installed > 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' book, > but can't seem to make any progress with it, I am too thick to make much > use of it. I was hoping I could scrape specified stuff off Web pages > without using it. Yes, you can scrape stuff off web pages without programming. What you do is you open the web page in your browser, then open a notebook and, with a pencil or pen, copy the bits you read into the notebook. If you're very skilled, you can avoid the pencil and paper and type directly into a text editor on the computer. But other than that, every website is different, so there is no short-cut to web scraping. You need to customize the scraping code for each website you scrape, and that means programming. Do you know how to program? Are you interested in learning? If the answer is No and No, then I suggestion you pony up some money and pay somebody who already knows how to program to do the job for you. If the answer is No and Yes, then start at the beginning. Do some programming tutorials, learn to program the basics before moving on to something moderately difficult like web scraping. If the answer is that you already know how to program, but just don't know how to do web scraping, then stick with it and you'll get there. Web scraping is tricky, but possible, and if you work hard at it you'll succeed. Unless you're an experienced programmer with all the right skills, don't expect this to be something you do in a few minutes. Depending on your level of experience, you could expect to spend dozens of hours to learn how to scrape a single website. (Fortunately, the second website will probably be a little easier, and the third easier still. By the time you've done a dozen, you'll wonder what the fuss was about.) By studying how other scraping programs work, and studying how your racing pages store data, you should be able to put the two together and see how to get the data you want. There's plenty of information to help you learn how to web scrape, with or without BeautifulSoup: https://startpage.com/do/search/?q=beautifulsoup+web+scraping https://ixquick.com/do/search/?q=python+web+scraping+examples https://duckduckgo.com/html/?q=requests%20python%20web%20scraping%20example but no alternative to actually writing code. > I have installed 'Requests' also, is there any code I > can use that you can suggest that can access the sort of Web page values > that I have referred to ? such as odds, names of runners, stuff like that > off the 'inspect element' or 'source' htaml pages, on www.Racingpost.com. Specifically those pages? Doubtful. If you are really lucky (1) somebody else has already done the programming, (2) they've made their program available to others, and (3) you can find that program on the Internet. Use the search engine of your choice to search for it. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-08-12 20:30 -0400 |
| Message-ID | <roy-008918.20303912082014@news.panix.com> |
| In reply to | #76155 |
In article <53eaab7d$0$29979$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > By studying how other scraping programs work, and studying how your racing > pages store data, you should be able to put the two together and see how to > get the data you want. It's also worth mentioning, that some web sites *want* you to have their data, and make it easy to do so by exposing it via public APIs or other download methods. Wikipedia. Many government web sites. Twitter. Facebook. Reddit. Whenever you start thinking about web scraping, it's always worth spending a little time investigating if such an API exists. If it does, that's where you want to go. If not, well, there's always Beautiful Soup :-)
[toc] | [prev] | [next] | [standalone]
| From | Peter Pearson <ppearson@nowhere.invalid> |
|---|---|
| Date | 2014-08-13 00:50 +0000 |
| Message-ID | <c4vr3fFd48mU1@mid.individual.net> |
| In reply to | #76154 |
On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote:
[snip]
> Dear Programmers, Thank you for your responses. I have installed
> 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup'
> book, but can't seem to make any progress with it, I am too thick to
> make much use of it. I was hoping I could scrape specified stuff off
> Web pages without using it.
I've only used BeautifulSoup a little bit, and am no expert, but
with it one can do wonderfully complex things with simple code.
Perhaps you can find some examples online; this newsgroup sometimes
has awesome demonstrations of BS prowess.
At the risk of embarrassing myself in public, I'll show you some
code I wrote that scrapes data from a web page containing a
description of a drug. The drug's web page contains the desired
data in tags that look like this:
<input id="form-widgets-minconcentration" name="form.widgets.minconcentration"
class="text-widget float-field" value="1.0" type="text" />
The following code finds all these tags and builds a dict by which you
can lookup the "value" for any given "name".
from BeautifulSoup import BeautifulSoup as BS
...
def dump_drug_data(url):
"""Fetch data from one drug's URL and print selected fields in columns.
"""
contents = urllib2.urlopen(url=url).read()
soup = BS(contents)
inputs = soup.findAll("input")
input_dict = dict((i.get("name"), i.get("value")) for i in inputs)
print(" ".join(f.format(input_dict[n]) for f, n in (
("{0:5s}", "form.widgets.absorption_halflife"),
("{0:5s}", "form.widgets.elimination_halflife"),
("{0:5s}", "form.widgets.minconcentration"),
("{0:5s}", "form.widgets.maxconcentration"),
("{0:13s}", "form.widgets.title"),
)))
Try giving a more specific picture of your quest, and it's very
likely that people smarter than me will give you good help.
--
To email me, substitute nowhere->spamcop, invalid->net.
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-08-13 14:53 +0000 |
| Message-ID | <lsfu5l$o7d$3@dont-email.me> |
| In reply to | #76147 |
On Tue, 12 Aug 2014 13:00:30 -0700, Simon Evans wrote:
> in accessing from the 'Racing Post' on a daily basis. Anyhow, the code
Following is some starter code. You will have to look at the output,
compare it to the web page, and work out how you want to process it
further. Note that I use beautifulsoup and requests. The output is the
html for each cell in the table with a line of "+" characters at the
table row breaks. I suggest you look at the beautifulsoup documentation
at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ to work out how
you may wish to select which table cells contain data you are interested
in and how to extract it.
#!/usr/bin/python
"""
Program to extract data from racingpost.
"""
from bs4 import BeautifulSoup
import requests
r = requests.get( "http://www.racingpost.com/horses2/cards/card.sd?
race_id=607466&r_date=2014-08-13#raceTabs=sc_" )
if r.status_code == 200:
soup = BeautifulSoup( r.content )
table = soup.find( "table", id="sc_horseCard" )
for row in table.find_all( "tr" ):
for cell in row.find_all( "td" ):
print cell
print "+++++++++++++++++++++++++++++++++++++"
else:
print "HTTP Status", r.status_code
--
Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web