Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #60775

Re: strip away html tags from extracted links

References <ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com> <l7aj48$84p$1@ger.gmane.org>
Date 2013-11-29 12:44 -0500
Subject Re: strip away html tags from extracted links
From Joel Goldstick <joel.goldstick@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.3404.1385747062.18130.python-list@python.org> (permalink)

Show all headers | View raw


[Multipart message — attachments visible in raw view] - view raw

On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wrote:

> On 29/11/2013 16:56, Max Cuban wrote:
>
>> I have the following code to extract certain links from a webpage:
>>
>> from bs4 import BeautifulSoup
>> import urllib2, sys
>> import re
>>
>> def tonaton():
>>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>>      req = urllib2.Request(site, headers=hdr)
>>      jobpass = urllib2.urlopen(req)
>>      invalid_tag = ('h2')
>>      soup = BeautifulSoup(jobpass)
>>      print soup.find_all('h2')
>>
>> The links are contained in the 'h2' tags so I get the links as follows:
>>
>> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
>> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
>> <h2><a href="/en/automobile-technician-accra">Automobile
>> Technician</a></h2>
>> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>>
>> But I'm interested in getting rid of all the 'h2' tags so that I have
>> links only in this manner:
>>
>> <a href="/en/cashiers-accra">cashiers </a>
>> <a href="/en/cake-baker-accra">Cake baker</a>
>> <a href="/en/automobile-technician-accra">Automobile Technician</a>
>> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>>
>>
>> This is more a beautiful soup question than python.  Have you gone
>> through their tutorial.  Check here:
>>
>
They have an example that looks close here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

In your case, you want the href values for the child of the h2 refences.

So this might be close (untested)

for link in soup.find_all('a'):
    print (link.a.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie






-- 
Joel Goldstick
http://joelgoldstick.com

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

strip away html tags from extracted links Max Cuban <edzeame@gmail.com> - 2013-11-29 08:56 -0800
  Re: strip away html tags from extracted links Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 17:33 +0000
  Re: strip away html tags from extracted links Chris Angelico <rosuav@gmail.com> - 2013-11-30 04:41 +1100
  Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:44 -0500
  Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:45 -0500
  Re: strip away html tags from extracted links Gene Heskett <gheskett@shentel.net> - 2013-11-29 13:45 -0500

csiph-web