Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99484

Screen scraper to get all 'a title' elements

X-Received by 10.31.11.75 with SMTP id 72mr23880078vkl.2.1448484121356; Wed, 25 Nov 2015 12:42:01 -0800 (PST)
X-Received by 10.50.78.134 with SMTP id b6mr187323igx.4.1448484121321; Wed, 25 Nov 2015 12:42:01 -0800 (PST)
Path csiph.com!optima2.xanadu-bbs.net!xanadu-bbs.net!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer03.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!f78no2682326qge.1!news-out.google.com!l1ni8806igd.0!nntp.google.com!mv3no3583583igc.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups comp.lang.python
Date Wed, 25 Nov 2015 12:42:00 -0800 (PST)
Complaints-To groups-abuse@google.com
Injection-Info glegroupsg2000goo.googlegroups.com; posting-host=155.201.35.66; posting-account=QHCkKAoAAAAtwxtoSlGaj-ksHegzSKUu
NNTP-Posting-Host 155.201.35.66
User-Agent G2/1.0
MIME-Version 1.0
Message-ID <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> (permalink)
Subject Screen scraper to get all 'a title' elements
From ryguy7272 <ryanshuell@gmail.com>
Injection-Date Wed, 25 Nov 2015 20:42:01 +0000
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-Received-Bytes 2667
X-Received-Body-CRC 3978769299
Xref csiph.com comp.lang.python:99484

Show key headers only | View raw


Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For instance, I see the following:
<a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
<a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
<a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
<a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"     
source_code = requests.get(url) 
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
    print(link)

All that does is get the title of the page.  I tried to get the links from that url, with this script.

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
  Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
    Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
      Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
        Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
          Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
            Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
              Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
            Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
            Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
        Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
          Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
            Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
  Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
    Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
  Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
  Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000

csiph-web