Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <l7aj48$84p$1@ger.gmane.org>
References: <ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com> <l7aj48$84p$1@ger.gmane.org>
Date: Fri, 29 Nov 2013 12:44:13 -0500
Subject: Re: strip away html tags from extracted links
From: Joel Goldstick <joel.goldstick@gmail.com>
To: Mark Lawrence <breamoreboy@yahoo.co.uk>
Content-Type: multipart/alternative; boundary=089e0122f348e10e6004ec5460b8
Cc: "python-list@python.org" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3404.1385747062.18130.python-list@python.org>
Lines: 158
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:60775

--089e0122f348e10e6004ec5460b8
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>wr=
ote:

> On 29/11/2013 16:56, Max Cuban wrote:
>
>> I have the following code to extract certain links from a webpage:
>>
>> from bs4 import BeautifulSoup
>> import urllib2, sys
>> import re
>>
>> def tonaton():
>>      site =3D "http://tonaton.com/en/job-vacancies-in-ghana"
>>      hdr =3D {'User-Agent' : 'Mozilla/5.0'}
>>      req =3D urllib2.Request(site, headers=3Dhdr)
>>      jobpass =3D urllib2.urlopen(req)
>>      invalid_tag =3D ('h2')
>>      soup =3D BeautifulSoup(jobpass)
>>      print soup.find_all('h2')
>>
>> The links are contained in the 'h2' tags so I get the links as follows:
>>
>> <h2><a href=3D"/en/cashiers-accra">cashiers </a></h2>
>> <h2><a href=3D"/en/cake-baker-accra">Cake baker</a></h2>
>> <h2><a href=3D"/en/automobile-technician-accra">Automobile
>> Technician</a></h2>
>> <h2><a href=3D"/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>>
>> But I'm interested in getting rid of all the 'h2' tags so that I have
>> links only in this manner:
>>
>> <a href=3D"/en/cashiers-accra">cashiers </a>
>> <a href=3D"/en/cake-baker-accra">Cake baker</a>
>> <a href=3D"/en/automobile-technician-accra">Automobile Technician</a>
>> <a href=3D"/en/marketing-officer-accra-4">Marketing Officer</a>
>>
>>
>> This is more a beautiful soup question than python.  Have you gone
>> through their tutorial.  Check here:
>>
>
They have an example that looks close here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

One common task is extracting all the URLs found within a page=E2=80=99s <a=
> tags:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

In your case, you want the href values for the child of the h2 refences.

So this might be close (untested)

for link in soup.find_all('a'):
    print (link.a.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie






--=20
Joel Goldstick
http://joelgoldstick.com

--089e0122f348e10e6004ec5460b8
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <span dir=3D"ltr">&=
lt;<a href=3D"mailto:breamoreboy@yahoo.co.uk" target=3D"_blank">breamoreboy=
@yahoo.co.uk</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D""><div clas=
s=3D"h5">On 29/11/2013 16:56, Max Cuban wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div>
I have the following code to extract certain links from a webpage:<br>
<br>
from bs4 import BeautifulSoup<br>
import urllib2, sys<br>
import re<br>
<br>
def tonaton():<br>
=C2=A0 =C2=A0 =C2=A0site =3D &quot;<a href=3D"http://tonaton.com/en/job-vac=
ancies-in-ghana" target=3D"_blank">http://tonaton.com/en/job-<u></u>vacanci=
es-in-ghana</a>&quot;<br>
=C2=A0 =C2=A0 =C2=A0hdr =3D {&#39;User-Agent&#39; : &#39;Mozilla/5.0&#39;}<=
br>
=C2=A0 =C2=A0 =C2=A0req =3D urllib2.Request(site, headers=3Dhdr)<br>
=C2=A0 =C2=A0 =C2=A0jobpass =3D urllib2.urlopen(req)<br>
=C2=A0 =C2=A0 =C2=A0invalid_tag =3D (&#39;h2&#39;)<br>
=C2=A0 =C2=A0 =C2=A0soup =3D BeautifulSoup(jobpass)<br>
=C2=A0 =C2=A0 =C2=A0print soup.find_all(&#39;h2&#39;)<br>
<br>
The links are contained in the &#39;h2&#39; tags so I get the links as foll=
ows:<br>
<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/cashiers-accra&quot;&gt;<u></u>cashiers &l=
t;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/cake-baker-accra&quot;&gt;<u></u>Cake bake=
r&lt;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/automobile-<u></u>technician-accra&quot;&g=
t;Automobile Technician&lt;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/marketing-officer-<u></u>accra-4&quot;&gt;=
Marketing Officer&lt;/a&gt;&lt;/h2&gt;<br>
<br>
But I&#39;m interested in getting rid of all the &#39;h2&#39; tags so that =
I have links only in this manner:<br>
<br>
&lt;a href=3D&quot;/en/cashiers-accra&quot;&gt;<u></u>cashiers &lt;/a&gt;<b=
r>
&lt;a href=3D&quot;/en/cake-baker-accra&quot;&gt;<u></u>Cake baker&lt;/a&gt=
;<br>
&lt;a href=3D&quot;/en/automobile-<u></u>technician-accra&quot;&gt;Automobi=
le Technician&lt;/a&gt;<br>
&lt;a href=3D&quot;/en/marketing-officer-<u></u>accra-4&quot;&gt;Marketing =
Officer&lt;/a&gt;<br>
<br>
<br></div>
This is more a beautiful soup question than python.=C2=A0 Have you gone thr=
ough their tutorial.=C2=A0 Check here:<br></blockquote></div></div></blockq=
uote><div><br></div>They have an example that looks close here: <a href=3D"=
http://www.crummy.com/software/BeautifulSoup/bs4/doc/">http://www.crummy.co=
m/software/BeautifulSoup/bs4/doc/</a><br>
<br>One common task is extracting all the URLs found within a page=E2=80=99=
s &lt;a&gt; tags:<br><br>for link in soup.find_all(&#39;a&#39;):<br>=C2=A0 =
=C2=A0 print(link.get(&#39;href&#39;))<br># <a href=3D"http://example.com/e=
lsie">http://example.com/elsie</a><br>
# <a href=3D"http://example.com/lacie">http://example.com/lacie</a><br># <a=
 href=3D"http://example.com/tillie">http://example.com/tillie</a><br><br></=
div><div class=3D"gmail_quote">In your case, you want the href values for t=
he child of the h2 refences.<br>
<br></div><div class=3D"gmail_quote">So this might be close (untested)<br><=
br>for link in soup.find_all(&#39;a&#39;):<br>=C2=A0 =C2=A0 print (link.a.g=
et(&#39;href&#39;))<br># <a href=3D"http://example.com/elsie">http://exampl=
e.com/elsie</a><br>
# <a href=3D"http://example.com/lacie">http://example.com/lacie</a><br># <a=
 href=3D"http://example.com/tillie">http://example.com/tillie</a><br><br><b=
r></div><div class=3D"gmail_quote"><br>=C2=A0</div><br clear=3D"all"><br>--=
 <br><div dir=3D"ltr">
<div>Joel Goldstick<br></div><a href=3D"http://joelgoldstick.com" target=3D=
"_blank">http://joelgoldstick.com</a><br></div>
</div></div>

--089e0122f348e10e6004ec5460b8--