Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #76158

Re: Suitable Python code to scrape specific details from web pages.

From Peter Pearson <ppearson@nowhere.invalid>
Newsgroups comp.lang.python
Subject Re: Suitable Python code to scrape specific details from web pages.
Date 2014-08-13 00:50 +0000
Message-ID <c4vr3fFd48mU1@mid.individual.net> (permalink)
References <a8f10c4f-d4a0-48ed-ae92-2a43e9a094c3@googlegroups.com> <e2011de5-10fa-4de1-89fa-4e41882a6646@googlegroups.com>

Show all headers | View raw


On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote:
[snip]
> Dear Programmers, Thank you for your responses. I have installed
> 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup'
> book, but can't seem to make any progress with it, I am too thick to
> make much use of it. I was hoping I could scrape specified stuff off
> Web pages without using it.

I've only used BeautifulSoup a little bit, and am no expert, but
with it one can do wonderfully complex things with simple code.
Perhaps you can find some examples online; this newsgroup sometimes
has awesome demonstrations of BS prowess.

At the risk of embarrassing myself in public, I'll show you some
code I wrote that scrapes data from a web page containing a
description of a drug.  The drug's web page contains the desired
data in tags that look like this:

<input id="form-widgets-minconcentration" name="form.widgets.minconcentration"
class="text-widget float-field" value="1.0" type="text" />

The following code finds all these tags and builds a dict by which you
can lookup the "value" for any given "name".

    from BeautifulSoup import BeautifulSoup as BS
    ...

    def dump_drug_data(url):
        """Fetch data from one drug's URL and print selected fields in columns.
        """
        contents = urllib2.urlopen(url=url).read()
        soup = BS(contents)
        inputs = soup.findAll("input")
        input_dict = dict((i.get("name"), i.get("value")) for i in inputs)
        print(" ".join(f.format(input_dict[n]) for f, n in (
                    ("{0:5s}", "form.widgets.absorption_halflife"),
                    ("{0:5s}", "form.widgets.elimination_halflife"),
                    ("{0:5s}", "form.widgets.minconcentration"),
                    ("{0:5s}", "form.widgets.maxconcentration"),
                    ("{0:13s}", "form.widgets.title"),
                    )))

Try giving a more specific picture of your quest, and it's very
likely that people smarter than me will give you good help.

-- 
To email me, substitute nowhere->spamcop, invalid->net.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Suitable Python code to scrape specific details from  web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 13:00 -0700
  Re: Suitable Python code to scrape specific details from  web pages. Rob Gaddi <rgaddi@technologyhighland.invalid> - 2014-08-12 13:11 -0700
  Re: Suitable Python code to scrape specific details from  web pages. Roy Smith <roy@panix.com> - 2014-08-12 17:28 -0400
    Re: Suitable Python code to scrape specific details from  web pages. alex23 <wuwei23@gmail.com> - 2014-08-18 15:04 +1000
  Re: Suitable Python code to scrape specific details from  web pages. Simon Evans <musicalhacksaw@yahoo.co.uk> - 2014-08-12 15:44 -0700
    Re: Suitable Python code to scrape specific details from  web pages. Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-13 10:04 +1000
      Re: Suitable Python code to scrape specific details from  web pages. Roy Smith <roy@panix.com> - 2014-08-12 20:30 -0400
    Re: Suitable Python code to scrape specific details from  web pages. Peter Pearson <ppearson@nowhere.invalid> - 2014-08-13 00:50 +0000
  Re: Suitable Python code to scrape specific details from  web pages. Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-13 14:53 +0000

csiph-web