Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99488
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | MRAB <python@mrabarnett.plus.com> |
| Newsgroups | comp.lang.python |
| Subject | Re: Screen scraper to get all 'a title' elements |
| Date | Wed, 25 Nov 2015 20:55:56 +0000 |
| Lines | 66 |
| Message-ID | <mailman.96.1448484959.20593.python-list@python.org> (permalink) |
| References | <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=utf-8; format=flowed |
| Content-Transfer-Encoding | 8bit |
| X-Trace | news.uni-berlin.de riWa5aOhjKBuA+LXS21HpQDsmzwmq6AoY2JAQHqYX2hA== |
| Return-Path | <python@mrabarnett.plus.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'elements.': 0.05; 'sys': 0.05; 'that?': 0.05; "'a'": 0.07; "subject:' ": 0.07; 'urllib2': 0.07; "'title'": 0.16; 'attempt.': 0.16; 'elements,': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:addr:python': 0.16; 'from:name:mrab': 0.16; 'html)': 0.16; 'message- id:@mrabarnett.plus.com': 0.16; 'received:192.168.1.4': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'soup': 0.16; 'wrote:': 0.16; 'element': 0.18; 'instance,': 0.18; 'script.': 0.18; 'trying': 0.22; 'tried': 0.24; 'import': 0.24; 'header:In- Reply-To:1': 0.24; 'requests': 0.25; 'script': 0.25; 'header:User- Agent:1': 0.26; "doesn't": 0.26; 'figure': 0.27; 'page.': 0.28; 'this.': 0.28; 'url:wikipedia': 0.29; "i'm": 0.30; 'print': 0.30; 'url:wiki': 0.30; 'code': 0.30; 'putting': 0.30; "i'd": 0.31; 'subject:all': 0.32; 'list': 0.34; 'so,': 0.35; 'should': 0.36; 'url:org': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'url:en': 0.39; 'does': 0.39; 'received:192': 0.39; 'to:addr:python.org': 0.40; 'skip:u 10': 0.61; 'experts.': 0.66; 'subject:get': 0.81; '"".': 0.84; 'netherlands': 0.84; 'worms': 0.84 |
| X-CM-Score | 0.00 |
| X-CNFS-Analysis | v=2.1 cv=CvRCCSMD c=1 sm=1 tr=0 a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17 a=0Bzu9jTXAAAA:8 a=EBOSESyhAAAA:8 a=IkcTkHD0fZMA:10 a=8pif782wAAAA:8 a=zHBAHhJAyLGfB3yiwrYA:9 a=QEXdDO2ut3YA:10 |
| X-AUTH | mrabarnett@:2500 |
| User-Agent | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
| In-Reply-To | <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Xref | csiph.com comp.lang.python:99488 |
Show key headers only | View raw
On 2015-11-25 20:42, ryguy7272 wrote:
> Hello experts. I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements. For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
>
> So, I tried putting a script together to get 'title'. Here's my attempt.
>
> import requests
> import sys
> from bs4 import BeautifulSoup
>
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> source_code = requests.get(url)
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
> print(link)
>
> All that does is get the title of the page. I tried to get the links from that url, with this script.
>
A 'title' element has the form "<title ...>". What you should be looking
for are 'a' elements, those of the form "<a ...>".
> import urllib2
> import re
>
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>
> #read html code
> html = website.read()
>
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
>
> print links
>
> That doesn't work wither. Basically, I'd like to see this.
>
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
>
> How can I do that?
> Thanks all!!
>
>
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000
csiph-web