Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #9987 > unrolled thread
| Started by | SAKTHEESH <s.a.saktheesh@gmail.com> |
|---|---|
| First post | 2011-07-20 11:18 -0700 |
| Last post | 2011-07-22 04:47 +0200 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
A little complex usage of Beautiful Soup Parsing Help! SAKTHEESH <s.a.saktheesh@gmail.com> - 2011-07-20 11:18 -0700
Re: A little complex usage of Beautiful Soup Parsing Help! Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-22 04:47 +0200
| From | SAKTHEESH <s.a.saktheesh@gmail.com> |
|---|---|
| Date | 2011-07-20 11:18 -0700 |
| Subject | A little complex usage of Beautiful Soup Parsing Help! |
| Message-ID | <7436d1f9-43b2-4565-ad40-30897b3e8410@t7g2000vbv.googlegroups.com> |
I am using Beautiful Soup to parse a html to find all text that is Not
contained inside any anchor elements
I came up with this code which finds all links within href but not the
other way around.
How can I modify this code to get only plain text using Beautiful
Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
Example:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within `<a></a>` : then
find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Thanks for your time and help !
[toc] | [next] | [standalone]
| From | Thomas 'PointedEars' Lahn <PointedEars@web.de> |
|---|---|
| Date | 2011-07-22 04:47 +0200 |
| Message-ID | <1506545.6Eb0tbSXBe@PointedEars.de> |
| In reply to | #9987 |
SAKTHEESH wrote: > I am using Beautiful Soup to parse a html to find all text that is Not > contained inside any anchor elements > > I came up with this code which finds all links within href _anchors_ _with_ `href' _attribute_ (commonly: links.) > but not the other way around. What would that be anyway? > How can I modify this code to get only plain text using Beautiful > Soup, so that I can do some find and replace and modify the soup? RTFM: <http://www.crummy.com/software/BeautifulSoup/documentation.html#contents> -- PointedEars Bitte keine Kopien per E-Mail. / Please do not Cc: me.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web