Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: MRAB Newsgroups: comp.lang.python Subject: Re: Screen scraper to get all 'a title' elements Date: Wed, 25 Nov 2015 20:55:56 +0000 Lines: 66 Message-ID: References: <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: news.uni-berlin.de riWa5aOhjKBuA+LXS21HpQDsmzwmq6AoY2JAQHqYX2hA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'elements.': 0.05; 'sys': 0.05; 'that?': 0.05; "'a'": 0.07; "subject:' ": 0.07; 'urllib2': 0.07; "'title'": 0.16; 'attempt.': 0.16; 'elements,': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'html)': 0.16; 'message- id:@mrabarnett.plus.com': 0.16; 'received:192.168.1.4': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'soup': 0.16; 'wrote:': 0.16; 'element': 0.18; 'instance,': 0.18; 'script.': 0.18; 'trying': 0.22; 'tried': 0.24; 'import': 0.24; 'header:In- Reply-To:1': 0.24; 'requests': 0.25; 'script': 0.25; 'header:User- Agent:1': 0.26; "doesn't": 0.26; 'figure': 0.27; 'page.': 0.28; 'this.': 0.28; 'url:wikipedia': 0.29; "i'm": 0.30; 'print': 0.30; 'url:wiki': 0.30; 'code': 0.30; 'putting': 0.30; "i'd": 0.31; 'subject:all': 0.32; 'list': 0.34; 'so,': 0.35; 'should': 0.36; 'url:org': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'url:en': 0.39; 'does': 0.39; 'received:192': 0.39; 'to:addr:python.org': 0.40; 'skip:u 10': 0.61; 'experts.': 0.66; 'subject:get': 0.81; '"".': 0.84; 'netherlands': 0.84; 'worms': 0.84 X-CM-Score: 0.00 X-CNFS-Analysis: v=2.1 cv=CvRCCSMD c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=EBOSESyhAAAA:8 a=IkcTkHD0fZMA:10 a=8pif782wAAAA:8 a=zHBAHhJAyLGfB3yiwrYA:9 a=QEXdDO2ut3YA:10 X-AUTH: mrabarnett@:2500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 In-Reply-To: <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:99488 On 2015-11-25 20:42, ryguy7272 wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > I'm trying to figure out how to list all 'a title' elements. For instance, I see the following: > Accident > Ala-Lemu > Alert > Apocalypse Peaks > > So, I tried putting a script together to get 'title'. Here's my attempt. > > import requests > import sys > from bs4 import BeautifulSoup > > url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names" > source_code = requests.get(url) > plain_text = source_code.text > soup = BeautifulSoup(plain_text) > for link in soup.findAll('title'): > print(link) > > All that does is get the title of the page. I tried to get the links from that url, with this script. > A 'title' element has the form "". What you should be looking for are 'a' elements, those of the form "<a ...>". > import urllib2 > import re > > #connect to a URL > website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') > > #read html code > html = website.read() > > #use re.findall to get all the links > links = re.findall('"((http|ftp)s?://.*?)"', html) > > print links > > That doesn't work wither. Basically, I'd like to see this. > > Accident > Ala-Lemu > Alert > Apocalypse Peaks > Athol > Å > Barbecue > Båstad > Bastardstown > Batman > Bathmen (Battem), Netherlands > ... > Worms > Yell > Zigzag > Zzyzx > > How can I do that? > Thanks all!! > >