Groups > comp.lang.python > #60772 > unrolled thread

strip away html tags from extracted links

Started by	Max Cuban <edzeame@gmail.com>
First post	2013-11-29 08:56 -0800
Last post	2013-11-29 13:45 -0500
Articles	6 — 5 participants

Back to article view | Back to comp.lang.python

  strip away html tags from extracted links Max Cuban <edzeame@gmail.com> - 2013-11-29 08:56 -0800
    Re: strip away html tags from extracted links Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 17:33 +0000
    Re: strip away html tags from extracted links Chris Angelico <rosuav@gmail.com> - 2013-11-30 04:41 +1100
    Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:44 -0500
    Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:45 -0500
    Re: strip away html tags from extracted links Gene Heskett <gheskett@shentel.net> - 2013-11-29 13:45 -0500

#60772 — strip away html tags from extracted links

From	Max Cuban <edzeame@gmail.com>
Date	2013-11-29 08:56 -0800
Subject	strip away html tags from extracted links
Message-ID	<ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com>

I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup
import urllib2, sys
import re

def tonaton():
    site = "http://tonaton.com/en/job-vacancies-in-ghana"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    invalid_tag = ('h2')
    soup = BeautifulSoup(jobpass)
    print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:

<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>
 

I therefore updated my code to look like this:

def tonaton():
    site = "http://tonaton.com/en/job-vacancies-in-ghana"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    invalid_tag = ('h2')
    soup = BeautifulSoup(jobpass)
    jobs = soup.find_all('h2')
    for tag in invalid_tag:
        for match in jobs(tag):
            match.replaceWithChildren()
    print jobs

But I couldn't get it to work, even though  I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.

Any help will be gracefully appreciated

Thanks

[toc] | [next] | [standalone]

#60773

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-29 17:33 +0000
Message-ID	<mailman.3402.1385746399.18130.python-list@python.org>
In reply to	#60772

On 29/11/2013 16:56, Max Cuban wrote:
> I have the following code to extract certain links from a webpage:
>
> from bs4 import BeautifulSoup
> import urllib2, sys
> import re
>
> def tonaton():
>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>      req = urllib2.Request(site, headers=hdr)
>      jobpass = urllib2.urlopen(req)
>      invalid_tag = ('h2')
>      soup = BeautifulSoup(jobpass)
>      print soup.find_all('h2')
>
> The links are contained in the 'h2' tags so I get the links as follows:
>
> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
> <h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>
> But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:
>
> <a href="/en/cashiers-accra">cashiers </a>
> <a href="/en/cake-baker-accra">Cake baker</a>
> <a href="/en/automobile-technician-accra">Automobile Technician</a>
> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>
>
> I therefore updated my code to look like this:
>
> def tonaton():
>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>      req = urllib2.Request(site, headers=hdr)
>      jobpass = urllib2.urlopen(req)
>      invalid_tag = ('h2')
>      soup = BeautifulSoup(jobpass)
>      jobs = soup.find_all('h2')
>      for tag in invalid_tag:
>          for match in jobs(tag):
>              match.replaceWithChildren()
>      print jobs
>
> But I couldn't get it to work, even though  I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.
>
> Any help will be gracefully appreciated
>
> Thanks
>

Please help us to help you.  A good starter is your versions of Python 
and OS.  But more importantly here, what does "couldn't get it to work" 
mean?  The output you get isn't what you expected?  You get a traceback, 
in which case please give us the whole of the output, not just the last 
line?

One last thing, I observe that you've a gmail address.  This is 
currently guaranteed to send shivers down my spine.  So if you're using 
google groups, would you be kind enough to read and action this, 
https://wiki.python.org/moin/GoogleGroupsPython, thanks.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#60774

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-30 04:41 +1100
Message-ID	<mailman.3403.1385746882.18130.python-list@python.org>
In reply to	#60772

On Sat, Nov 30, 2013 at 4:33 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> One last thing, I observe that you've a gmail address.  This is currently
> guaranteed to send shivers down my spine.  So if you're using google groups,
> would you be kind enough to read and action this,
> https://wiki.python.org/moin/GoogleGroupsPython, thanks.

Don't blame all gmail users, some of us are using the mailing list. :)
You should be able to check the headers - with the email posts,
there's an Injection-Info header which cites Google Groups. Presumably
you get the same or similar if you read as a newsgroup.

And the OP was, indeed, using GG. Why is it so suddenly so popular?

ChrisA

[toc] | [prev] | [next] | [standalone]

#60775

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2013-11-29 12:44 -0500
Message-ID	<mailman.3404.1385747062.18130.python-list@python.org>
In reply to	#60772

[Multipart message — attachments visible in raw view] — view raw

On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wrote:

> On 29/11/2013 16:56, Max Cuban wrote:
>
>> I have the following code to extract certain links from a webpage:
>>
>> from bs4 import BeautifulSoup
>> import urllib2, sys
>> import re
>>
>> def tonaton():
>>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>>      req = urllib2.Request(site, headers=hdr)
>>      jobpass = urllib2.urlopen(req)
>>      invalid_tag = ('h2')
>>      soup = BeautifulSoup(jobpass)
>>      print soup.find_all('h2')
>>
>> The links are contained in the 'h2' tags so I get the links as follows:
>>
>> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
>> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
>> <h2><a href="/en/automobile-technician-accra">Automobile
>> Technician</a></h2>
>> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>>
>> But I'm interested in getting rid of all the 'h2' tags so that I have
>> links only in this manner:
>>
>> <a href="/en/cashiers-accra">cashiers </a>
>> <a href="/en/cake-baker-accra">Cake baker</a>
>> <a href="/en/automobile-technician-accra">Automobile Technician</a>
>> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>>
>>
>> This is more a beautiful soup question than python.  Have you gone
>> through their tutorial.  Check here:
>>
>
They have an example that looks close here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

In your case, you want the href values for the child of the h2 refences.

So this might be close (untested)

for link in soup.find_all('a'):
    print (link.a.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie






-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#60776

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2013-11-29 12:45 -0500
Message-ID	<mailman.3405.1385747163.18130.python-list@python.org>
In reply to	#60772

[Multipart message — attachments visible in raw view] — view raw

On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick
<joel.goldstick@gmail.com>wrote:

>
>
>
> On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wrote:
>
>> On 29/11/2013 16:56, Max Cuban wrote:
>>
>>> I have the following code to extract certain links from a webpage:
>>>
>>> from bs4 import BeautifulSoup
>>> import urllib2, sys
>>> import re
>>>
>>> def tonaton():
>>>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>>>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>>>      req = urllib2.Request(site, headers=hdr)
>>>      jobpass = urllib2.urlopen(req)
>>>      invalid_tag = ('h2')
>>>      soup = BeautifulSoup(jobpass)
>>>      print soup.find_all('h2')
>>>
>>> The links are contained in the 'h2' tags so I get the links as follows:
>>>
>>> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
>>> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
>>> <h2><a href="/en/automobile-technician-accra">Automobile
>>> Technician</a></h2>
>>> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>>>
>>> But I'm interested in getting rid of all the 'h2' tags so that I have
>>> links only in this manner:
>>>
>>> <a href="/en/cashiers-accra">cashiers </a>
>>> <a href="/en/cake-baker-accra">Cake baker</a>
>>> <a href="/en/automobile-technician-accra">Automobile Technician</a>
>>> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>>>
>>>
>>> This is more a beautiful soup question than python.  Have you gone
>>> through their tutorial.  Check here:
>>>
>>
> They have an example that looks close here:
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/
>
> One common task is extracting all the URLs found within a page’s <a> tags:
>
> for link in soup.find_all('a'):
>     print(link.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
> In your case, you want the href values for the child of the h2 refences.
>
> So this might be close (untested)
>

Pardon my typo.  Try this:

>
> for link in soup.find_all('h2'):
>     print (link.a.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
>
>
>
>
>
> --
> Joel Goldstick
> http://joelgoldstick.com
>



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#60777

From	Gene Heskett <gheskett@shentel.net>
Date	2013-11-29 13:45 -0500
Message-ID	<mailman.3406.1385751173.18130.python-list@python.org>
In reply to	#60772

On Friday 29 November 2013 13:44:57 Chris Angelico did opine:

> On Sat, Nov 30, 2013 at 4:33 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> 
wrote:
> > One last thing, I observe that you've a gmail address.  This is
> > currently guaranteed to send shivers down my spine.  So if you're
> > using google groups, would you be kind enough to read and action
> > this,
> > https://wiki.python.org/moin/GoogleGroupsPython, thanks.
> 
> Don't blame all gmail users, some of us are using the mailing list. :)
> You should be able to check the headers - with the email posts,
> there's an Injection-Info header which cites Google Groups. Presumably
> you get the same or similar if you read as a newsgroup.
> 
> And the OP was, indeed, using GG. Why is it so suddenly so popular?
> 
> ChrisA

Thank you for that hint Chris, it should enhance my enjoyment of this list.

Cheers, Gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

There is a 20% chance of tomorrow.
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
         law-abiding citizens.

[toc] | [prev] | [standalone]

csiph-web

strip away html tags from extracted links

Contents

#60772 — strip away html tags from extracted links

#60773

#60774

#60775

#60776

#60777