Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #9987 > unrolled thread

A little complex usage of Beautiful Soup Parsing Help!

Started bySAKTHEESH <s.a.saktheesh@gmail.com>
First post2011-07-20 11:18 -0700
Last post2011-07-22 04:47 +0200
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  A little complex usage of Beautiful Soup Parsing Help! SAKTHEESH <s.a.saktheesh@gmail.com> - 2011-07-20 11:18 -0700
    Re: A little complex usage of Beautiful Soup Parsing Help! Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-07-22 04:47 +0200

#9987 — A little complex usage of Beautiful Soup Parsing Help!

FromSAKTHEESH <s.a.saktheesh@gmail.com>
Date2011-07-20 11:18 -0700
SubjectA little complex usage of Beautiful Soup Parsing Help!
Message-ID<7436d1f9-43b2-4565-ad40-30897b3e8410@t7g2000vbv.googlegroups.com>
I am using Beautiful Soup to parse a html to find all text that is Not
contained inside any anchor elements

I came up with this code which finds all links within href but not the
other way around.

How can I modify this code to get only plain text using Beautiful
Soup, so that I can do some find and replace and modify the soup?

    for a in soup.findAll('a',href=True):
        print a['href']


Example:

    <html><body>
     <div> <a href="www.test1.com/identify">test1</a> </div>
     <div><br></div>
     <div><a href="www.test2.com/identify">test2</a></div>
     <div><br></div><div><br></div>
     <div>
       This should be identified

       Identify me 1

       Identify me 2
       <p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
     </div>
    </body></html>

Output:

    This should be identified
    Identify me 1
    Identify me 2
    This paragraph should be identified.

I am doing this operation to find text not within `<a></a>` : then
find "Identify" and do replace operation with "Replaced"

So the final output will be like this:

    <html><body>
     <div> <a href="www.test1.com/identify">test1</a> </div>
     <div><br></div>
     <div><a href="www.test2.com/identify">test2</a></div>
     <div><br></div><div><br></div>
     <div>
       This should be identified

       Repalced me 1

       Replaced me 2
       <p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
     </div>
    </body></html>

Thanks for your time and help !

[toc] | [next] | [standalone]


#10062

FromThomas 'PointedEars' Lahn <PointedEars@web.de>
Date2011-07-22 04:47 +0200
Message-ID<1506545.6Eb0tbSXBe@PointedEars.de>
In reply to#9987
SAKTHEESH wrote:

> I am using Beautiful Soup to parse a html to find all text that is Not
> contained inside any anchor elements
> 
> I came up with this code which finds all links within href 

_anchors_ _with_ `href' _attribute_ (commonly: links.)

> but not the other way around.

What would that be anyway?

> How can I modify this code to get only plain text using Beautiful
> Soup, so that I can do some find and replace and modify the soup?

RTFM: 
<http://www.crummy.com/software/BeautifulSoup/documentation.html#contents>

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web