Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <CAPM-O+yXHomAcHWOVVb4Kn98fc5EV5=EUGJYZq==Wb3-NttbZA@mail.gmail.com>
References: <ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com> <l7aj48$84p$1@ger.gmane.org> <CAPM-O+yXHomAcHWOVVb4Kn98fc5EV5=EUGJYZq==Wb3-NttbZA@mail.gmail.com>
Date: Fri, 29 Nov 2013 12:45:54 -0500
Subject: Re: strip away html tags from extracted links
From: Joel Goldstick <joel.goldstick@gmail.com>
To: Mark Lawrence <breamoreboy@yahoo.co.uk>
Content-Type: multipart/alternative; boundary=047d7bd6b4b2ea2c8504ec5466f2
Cc: "python-list@python.org" <python-list@python.org>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3405.1385747163.18130.python-list@python.org>
Lines: 199
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:60776

--047d7bd6b4b2ea2c8504ec5466f2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick
<joel.goldstick@gmail.com>wrote:

>
>
>
> On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy@yahoo.co.uk>=
wrote:
>
>> On 29/11/2013 16:56, Max Cuban wrote:
>>
>>> I have the following code to extract certain links from a webpage:
>>>
>>> from bs4 import BeautifulSoup
>>> import urllib2, sys
>>> import re
>>>
>>> def tonaton():
>>>      site =3D "http://tonaton.com/en/job-vacancies-in-ghana"
>>>      hdr =3D {'User-Agent' : 'Mozilla/5.0'}
>>>      req =3D urllib2.Request(site, headers=3Dhdr)
>>>      jobpass =3D urllib2.urlopen(req)
>>>      invalid_tag =3D ('h2')
>>>      soup =3D BeautifulSoup(jobpass)
>>>      print soup.find_all('h2')
>>>
>>> The links are contained in the 'h2' tags so I get the links as follows:
>>>
>>> <h2><a href=3D"/en/cashiers-accra">cashiers </a></h2>
>>> <h2><a href=3D"/en/cake-baker-accra">Cake baker</a></h2>
>>> <h2><a href=3D"/en/automobile-technician-accra">Automobile
>>> Technician</a></h2>
>>> <h2><a href=3D"/en/marketing-officer-accra-4">Marketing Officer</a></h2=
>
>>>
>>> But I'm interested in getting rid of all the 'h2' tags so that I have
>>> links only in this manner:
>>>
>>> <a href=3D"/en/cashiers-accra">cashiers </a>
>>> <a href=3D"/en/cake-baker-accra">Cake baker</a>
>>> <a href=3D"/en/automobile-technician-accra">Automobile Technician</a>
>>> <a href=3D"/en/marketing-officer-accra-4">Marketing Officer</a>
>>>
>>>
>>> This is more a beautiful soup question than python.  Have you gone
>>> through their tutorial.  Check here:
>>>
>>
> They have an example that looks close here:
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/
>
> One common task is extracting all the URLs found within a page=E2=80=99s =
<a> tags:
>
> for link in soup.find_all('a'):
>     print(link.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
> In your case, you want the href values for the child of the h2 refences.
>
> So this might be close (untested)
>

Pardon my typo.  Try this:

>
> for link in soup.find_all('h2'):
>     print (link.a.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
>
>
>
>
>
> --
> Joel Goldstick
> http://joelgoldstick.com
>



--=20
Joel Goldstick
http://joelgoldstick.com

--047d7bd6b4b2ea2c8504ec5466f2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick <span dir=3D"ltr">=
&lt;<a href=3D"mailto:joel.goldstick@gmail.com" target=3D"_blank">joel.gold=
stick@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><br><div class=3D"gmail_ext=
ra"><br><br><div class=3D"gmail_quote"><div class=3D"im">On Fri, Nov 29, 20=
13 at 12:33 PM, Mark Lawrence <span dir=3D"ltr">&lt;<a href=3D"mailto:bream=
oreboy@yahoo.co.uk" target=3D"_blank">breamoreboy@yahoo.co.uk</a>&gt;</span=
> wrote:<br>

</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div clas=
s=3D"im">On 29/11/2013 16:56, Max Cuban wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D"im"><=
div>
I have the following code to extract certain links from a webpage:<br>
<br>
from bs4 import BeautifulSoup<br>
import urllib2, sys<br>
import re<br>
<br>
def tonaton():<br>
=C2=A0 =C2=A0 =C2=A0site =3D &quot;<a href=3D"http://tonaton.com/en/job-vac=
ancies-in-ghana" target=3D"_blank">http://tonaton.com/en/job-<u></u>vacanci=
es-in-ghana</a>&quot;<br>
=C2=A0 =C2=A0 =C2=A0hdr =3D {&#39;User-Agent&#39; : &#39;Mozilla/5.0&#39;}<=
br>
=C2=A0 =C2=A0 =C2=A0req =3D urllib2.Request(site, headers=3Dhdr)<br>
=C2=A0 =C2=A0 =C2=A0jobpass =3D urllib2.urlopen(req)<br>
=C2=A0 =C2=A0 =C2=A0invalid_tag =3D (&#39;h2&#39;)<br>
=C2=A0 =C2=A0 =C2=A0soup =3D BeautifulSoup(jobpass)<br>
=C2=A0 =C2=A0 =C2=A0print soup.find_all(&#39;h2&#39;)<br>
<br>
The links are contained in the &#39;h2&#39; tags so I get the links as foll=
ows:<br>
<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/cashiers-accra&quot;&gt;<u></u>cashiers &l=
t;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/cake-baker-accra&quot;&gt;<u></u>Cake bake=
r&lt;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/automobile-<u></u>technician-accra&quot;&g=
t;Automobile Technician&lt;/a&gt;&lt;/h2&gt;<br>
&lt;h2&gt;&lt;a href=3D&quot;/en/marketing-officer-<u></u>accra-4&quot;&gt;=
Marketing Officer&lt;/a&gt;&lt;/h2&gt;<br>
<br>
But I&#39;m interested in getting rid of all the &#39;h2&#39; tags so that =
I have links only in this manner:<br>
<br>
&lt;a href=3D&quot;/en/cashiers-accra&quot;&gt;<u></u>cashiers &lt;/a&gt;<b=
r>
&lt;a href=3D&quot;/en/cake-baker-accra&quot;&gt;<u></u>Cake baker&lt;/a&gt=
;<br>
&lt;a href=3D&quot;/en/automobile-<u></u>technician-accra&quot;&gt;Automobi=
le Technician&lt;/a&gt;<br>
&lt;a href=3D&quot;/en/marketing-officer-<u></u>accra-4&quot;&gt;Marketing =
Officer&lt;/a&gt;<br>
<br>
<br></div></div>
This is more a beautiful soup question than python.=C2=A0 Have you gone thr=
ough their tutorial.=C2=A0 Check here:<br></blockquote></div></div></blockq=
uote><div><br></div>They have an example that looks close here: <a href=3D"=
http://www.crummy.com/software/BeautifulSoup/bs4/doc/" target=3D"_blank">ht=
tp://www.crummy.com/software/BeautifulSoup/bs4/doc/</a><br>

<br>One common task is extracting all the URLs found within a page=E2=80=99=
s &lt;a&gt; tags:<br><br>for link in soup.find_all(&#39;a&#39;):<br>=C2=A0 =
=C2=A0 print(link.get(&#39;href&#39;))<br># <a href=3D"http://example.com/e=
lsie" target=3D"_blank">http://example.com/elsie</a><br>

# <a href=3D"http://example.com/lacie" target=3D"_blank">http://example.com=
/lacie</a><br># <a href=3D"http://example.com/tillie" target=3D"_blank">htt=
p://example.com/tillie</a><br><br></div><div class=3D"gmail_quote">In your =
case, you want the href values for the child of the h2 refences.<br>

<br></div><div class=3D"gmail_quote">So this might be close (untested)<br><=
/div></div></div></blockquote><div><br></div><div>Pardon my typo.=C2=A0 Try=
 this: <br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><br>=
for link in soup.find_all(&#39;h2&#39;):<br>=C2=A0 =C2=A0 print (link.a.get=
(&#39;href&#39;))<br># <a href=3D"http://example.com/elsie" target=3D"_blan=
k">http://example.com/elsie</a><br>

# <a href=3D"http://example.com/lacie" target=3D"_blank">http://example.com=
/lacie</a><br># <a href=3D"http://example.com/tillie" target=3D"_blank">htt=
p://example.com/tillie</a><span class=3D"HOEnZb"><font color=3D"#888888"><b=
r><br><br>
</font></span></div><span class=3D"HOEnZb"><font color=3D"#888888"><div cla=
ss=3D"gmail_quote"><br>=C2=A0</div><br clear=3D"all"><br>-- <br><div dir=3D=
"ltr">
<div>Joel Goldstick<br></div><a href=3D"http://joelgoldstick.com" target=3D=
"_blank">http://joelgoldstick.com</a><br></div>
</font></span></div></div>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div dir=3D"ltr"><div>J=
oel Goldstick<br></div><a href=3D"http://joelgoldstick.com" target=3D"_blan=
k">http://joelgoldstick.com</a><br></div>
</div></div>

--047d7bd6b4b2ea2c8504ec5466f2--