Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99498
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: Screen scraper to get all 'a title' elements |
| Date | 2015-11-25 23:30 +0100 |
| Organization | A noiseless patient Spider |
| Message-ID | <n35ckk$9q0$1@dont-email.me> (permalink) |
| References | <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> <mailman.96.1448484959.20593.python-list@python.org> |
Hi
It seems that links on that Wikipedia page follow the structure :
<a href="..." title="...">
You could extract a list of link titles with something like :
re.findall( r'\<a[^>]+title="(.+?)"', html )
HTH,
-Grobu-
On 25/11/15 21:55, MRAB wrote:
> On 2015-11-25 20:42, ryguy7272 wrote:
>> Hello experts. I'm looking at this url:
>> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>>
>> I'm trying to figure out how to list all 'a title' elements. For
>> instance, I see the following:
>> <a title="Accident, Maryland"
>> href="/wiki/Accident,_Maryland">Accident</a>
>> <a class="new" title="Ala-Lemu (page does not exist)"
>> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
>> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
>> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
>> Peaks</a>
>>
>> So, I tried putting a script together to get 'title'. Here's my attempt.
>>
>> import requests
>> import sys
>> from bs4 import BeautifulSoup
>>
>> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>> source_code = requests.get(url)
>> plain_text = source_code.text
>> soup = BeautifulSoup(plain_text)
>> for link in soup.findAll('title'):
>> print(link)
>>
>> All that does is get the title of the page. I tried to get the links
>> from that url, with this script.
>>
> A 'title' element has the form "<title ...>". What you should be looking
> for are 'a' elements, those of the form "<a ...>".
>
>> import urllib2
>> import re
>>
>> #connect to a URL
>> website =
>> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>>
>>
>> #read html code
>> html = website.read()
>>
>> #use re.findall to get all the links
>> links = re.findall('"((http|ftp)s?://.*?)"', html)
>>
>> print links
>>
>> That doesn't work wither. Basically, I'd like to see this.
>>
>> Accident
>> Ala-Lemu
>> Alert
>> Apocalypse Peaks
>> Athol
>> Å
>> Barbecue
>> Båstad
>> Bastardstown
>> Batman
>> Bathmen (Battem), Netherlands
>> ...
>> Worms
>> Yell
>> Zigzag
>> Zzyzx
>>
>> How can I do that?
>> Thanks all!!
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000
csiph-web