Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #60773

Re: strip away html tags from extracted links

Path csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'from:addr:yahoo.co.uk': 0.04; 'newbie': 0.05; 'output': 0.05; 'sys': 0.07; 'cashiers': 0.09; 'lawrence': 0.09; 'logic': 0.09; 'output,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'req': 0.09; 'starter': 0.09; 'technician': 0.09; 'tismer': 0.09; 'python': 0.11; 'def': 0.12; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'soup': 0.16; 'urllib2,': 0.16; 'work"': 0.16; 'language': 0.16; 'wrote:': 0.18; 'work,': 0.20; 'thanks.': 0.20; 'programming': 0.22; 'import': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'rid': 0.24; 'url:moin': 0.24; 'versions': 0.24; 'appreciated': 0.26; 'this:': 0.26; 'second': 0.26; 'certain': 0.27; 'header:X-Complaints-To:1': 0.27; 'header :In-Reply-To:1': 0.27; 'officer': 0.29; "i'm": 0.30; 'code': 0.31; 'getting': 0.31; "skip:' 10": 0.31; 'url:wiki': 0.31; 'baker': 0.31; 'extract': 0.31; 'url:python': 0.33; 'updated': 0.34; 'subject:from': 0.34; 'could': 0.34; 'skip:u 20': 0.35; 'something': 0.35; 'done.': 0.35; 'but': 0.35; 'google': 0.35; 'there': 0.35; 'thanks': 0.36; 'url:org': 0.36; 'christian': 0.38; 'follows:': 0.38; 'gmail': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; "couldn't": 0.39; 'to:addr:python.org': 0.39; 'address.': 0.39; 'enough': 0.39; 'received:org': 0.40; 'even': 0.60; 'read': 0.60; 'tag': 0.61; 'world.': 0.61; "you're": 0.61; 'you.': 0.62; "you've": 0.63; 'kind': 0.63; 'more': 0.64; 'jobs': 0.68; 'marketing': 0.70; 'therefore': 0.72; 'guaranteed': 0.75; 'thing,': 0.91
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Mark Lawrence <breamoreboy@yahoo.co.uk>
Subject Re: strip away html tags from extracted links
Date Fri, 29 Nov 2013 17:33:01 +0000
References <ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host host-78-146-14-93.as13285.net
User-Agent Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.1.1
In-Reply-To <ab5d3c8b-401f-458d-9701-fa283936a6ff@googlegroups.com>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3402.1385746399.18130.python-list@python.org> (permalink)
Lines 70
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1385746399 news.xs4all.nl 15886 [2001:888:2000:d::a6]:45902
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:60773

Show key headers only | View raw


On 29/11/2013 16:56, Max Cuban wrote:
> I have the following code to extract certain links from a webpage:
>
> from bs4 import BeautifulSoup
> import urllib2, sys
> import re
>
> def tonaton():
>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>      req = urllib2.Request(site, headers=hdr)
>      jobpass = urllib2.urlopen(req)
>      invalid_tag = ('h2')
>      soup = BeautifulSoup(jobpass)
>      print soup.find_all('h2')
>
> The links are contained in the 'h2' tags so I get the links as follows:
>
> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
> <h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>
> But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:
>
> <a href="/en/cashiers-accra">cashiers </a>
> <a href="/en/cake-baker-accra">Cake baker</a>
> <a href="/en/automobile-technician-accra">Automobile Technician</a>
> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>
>
> I therefore updated my code to look like this:
>
> def tonaton():
>      site = "http://tonaton.com/en/job-vacancies-in-ghana"
>      hdr = {'User-Agent' : 'Mozilla/5.0'}
>      req = urllib2.Request(site, headers=hdr)
>      jobpass = urllib2.urlopen(req)
>      invalid_tag = ('h2')
>      soup = BeautifulSoup(jobpass)
>      jobs = soup.find_all('h2')
>      for tag in invalid_tag:
>          for match in jobs(tag):
>              match.replaceWithChildren()
>      print jobs
>
> But I couldn't get it to work, even though  I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.
>
> Any help will be gracefully appreciated
>
> Thanks
>

Please help us to help you.  A good starter is your versions of Python 
and OS.  But more importantly here, what does "couldn't get it to work" 
mean?  The output you get isn't what you expected?  You get a traceback, 
in which case please give us the whole of the output, not just the last 
line?

One last thing, I observe that you've a gmail address.  This is 
currently guaranteed to send shivers down my spine.  So if you're using 
google groups, would you be kind enough to read and action this, 
https://wiki.python.org/moin/GoogleGroupsPython, thanks.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

strip away html tags from extracted links Max Cuban <edzeame@gmail.com> - 2013-11-29 08:56 -0800
  Re: strip away html tags from extracted links Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-29 17:33 +0000
  Re: strip away html tags from extracted links Chris Angelico <rosuav@gmail.com> - 2013-11-30 04:41 +1100
  Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:44 -0500
  Re: strip away html tags from extracted links Joel Goldstick <joel.goldstick@gmail.com> - 2013-11-29 12:45 -0500
  Re: strip away html tags from extracted links Gene Heskett <gheskett@shentel.net> - 2013-11-29 13:45 -0500

csiph-web