Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!us.feeder.erje.net!feeder.erje.net!eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Peter Pearson Newsgroups: comp.lang.python Subject: Re: Suitable Python code to scrape specific details from web pages. Date: 13 Aug 2014 00:50:55 GMT Lines: 47 Message-ID: References: X-Trace: individual.net A9BxQZ4t6JpBe+J5nGOKswTK+chZHUwF1iLxn4K9by/tUr3tqN Cancel-Lock: sha1:EdLkg8hky0eg4RUfnW/g6NHlpbg= User-Agent: slrn/pre1.0.0-18 (Linux) Xref: csiph.com comp.lang.python:76158 On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote: [snip] > Dear Programmers, Thank you for your responses. I have installed > 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup' > book, but can't seem to make any progress with it, I am too thick to > make much use of it. I was hoping I could scrape specified stuff off > Web pages without using it. I've only used BeautifulSoup a little bit, and am no expert, but with it one can do wonderfully complex things with simple code. Perhaps you can find some examples online; this newsgroup sometimes has awesome demonstrations of BS prowess. At the risk of embarrassing myself in public, I'll show you some code I wrote that scrapes data from a web page containing a description of a drug. The drug's web page contains the desired data in tags that look like this: The following code finds all these tags and builds a dict by which you can lookup the "value" for any given "name". from BeautifulSoup import BeautifulSoup as BS ... def dump_drug_data(url): """Fetch data from one drug's URL and print selected fields in columns. """ contents = urllib2.urlopen(url=url).read() soup = BS(contents) inputs = soup.findAll("input") input_dict = dict((i.get("name"), i.get("value")) for i in inputs) print(" ".join(f.format(input_dict[n]) for f, n in ( ("{0:5s}", "form.widgets.absorption_halflife"), ("{0:5s}", "form.widgets.elimination_halflife"), ("{0:5s}", "form.widgets.minconcentration"), ("{0:5s}", "form.widgets.maxconcentration"), ("{0:13s}", "form.widgets.title"), ))) Try giving a more specific picture of your quest, and it's very likely that people smarter than me will give you good help. -- To email me, substitute nowhere->spamcop, invalid->net.