Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #25119

How to pick content from html using beatifulsoup

Return-Path <sheetalsingh@shopzilla.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'output': 0.04; 'newbie': 0.05; 'none:': 0.05; 'subject:How': 0.09; '#print': 0.09; 'fetch': 0.09; 'skip:# 30': 0.09; 'snippet': 0.09; 'subject:using': 0.09; 'suggest': 0.11; '&lt;div': 0.16; '&lt;input': 0.16; 'csv': 0.16; 'received:corp': 0.16; 'seller': 0.16; 'skip:{ 40': 0.16; 'soup': 0.16; 'skip:i 40': 0.17; 'skip:{ 20': 0.17; 'to:name:python-list@python.org': 0.20; 'url:gt': 0.22; 'amazon': 0.24; 'structure': 0.32; 'print': 0.32; 'skip:s 30': 0.33; 'singh': 0.33; 'to:addr:python-list': 0.33; 'code:': 0.33; 'hi,': 0.33; 'skip:b 20': 0.34; 'screen': 0.34; 'url:org': 0.36; 'skip:{ 10': 0.36; 'charset:us-ascii': 0.36; 'url:rec-html40': 0.37; 'to:addr:python.org': 0.39; 'url:schemas': 0.39; 'url:office': 0.39; 'url:omml': 0.39; 'url:2004': 0.39; 'url:microsoft': 0.39; 'url:12': 0.40; 'save': 0.61; 'brands': 0.61; 'side': 0.61; 'brand': 0.78; 'marketplace': 0.78; 'nokia': 0.84; 'samsung': 0.84; 'subject:content': 0.84; 'url:quot': 0.84; '&lt;a': 0.91
DKIM-Signature v=1; a=rsa-sha1; c=relaxed/relaxed; s=s1024;d=shopzilla.com; h=from:to:subject:date:message-id:content-type:mime-version; bh=Vgvw1k01EczyjQ1d3cgSx35fmV4=; b=eD+yGD9xdlJaHen5FPPa0KrTl1qtKaNVJdeQRSkPXUG3fB6xpPEp93kEPQWj30byb9T/8I2W 9W1+2qmOxnGwI0KL32SdGv09qpQQuaN3WpPGpC54ah+wUKiCioAoLrnY1g0RpGKiQQcq4l7E 2uNDXy0tx6uIzvnKgY6ItJjFYAU=
DomainKey-Signature a=rsa-sha1; q=dns; c=nofws; s=s1024;d=shopzilla.com; h=from:to:subject:date:message-id:content-type:mime-version; b=4F8GWPTnfDDZepFLBqI31IHOWJd/pwRhOASTOSVWwktOLL3sazOr1CJ8txACXCrb/kivi3lL uQlmCXDtK9SSqnSTnUY3g/FCOcOZtMR5f4wg+yjOz3arZcuyo16jLnZkNZsd8IVIy7BBUH0q Lh8ck+Cf9M8oylxTv9nQsxj/Fvw=
From Sheetal Singh <sheetalsingh@shopzilla.com>
To "python-list@python.org" <python-list@python.org>
Subject How to pick content from html using beatifulsoup
Thread-Topic How to pick content from html using beatifulsoup
Thread-Index Ac1eTqrj10GHP+ZKTGqzLeaUDOepZg==
Date Tue, 10 Jul 2012 04:02:28 +0000
Accept-Language en-IN, en-US
Content-Language en-US
X-MS-Has-Attach yes
X-MS-TNEF-Correlator
x-originating-ip [10.40.4.17]
Content-Type multipart/mixed; boundary="_004_7EC567BF36771942AE9F9279F86F13703B455ASZHQMSXNODE1Bshop_"
MIME-Version 1.0
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.1975.1341893033.4697.python-list@python.org> (permalink)
Lines 519
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1341893033 news.xs4all.nl 6863 [2001:888:2000:d::a6]:54121
X-Complaints-To abuse@xs4all.nl
Path csiph.com!usenet.pasdenom.info!news.stben.net!border3.nntp.ams.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Xref csiph.com comp.lang.python:25119

Show key headers only | View raw


[Multipart message — attachments visible in raw view] - view raw

Hi,

I am a newbie in python, I need to fetch names of side filters and save in csv [PFA screen shot].

Following is snippet from code:
  soup = BeautifulStoneSoup(html)
#                for e in soup.findAll('div'):
#                     for c in e.findAll('h3'):
#                        for d in c.findAll('li'):
#                            print'@@@@@@@', d.extract()
#

#                #select_pod=soup.findAll('div', {"class":"win aboutUs"})
#                #promeg= select_pod[0].findAll("p")[0]
#
#



#                for dv in soup.findAll('div', {"class":"attribution"}):
#                            ds = dv.findAll("<h3>")
#                            print ds



                select_pod = soup.findAll('div')
                print select_pod
                for j in select_pod:
                        if j is not None:
                            print j.findall('a')
                promeg = select_pod.findAll("<h3>")
                #print '--', promeg




                #hreflist = [ each.get('value') for each in soup.findAll('<h3>') ]


                for m in promeg :
                                if m:
                                        print 'Data values', m
                                        fd1.writerow([x[2], m, i[0], "Data Found"])


Structure of HTML:

<div class="attribution">
<div>
<h3>By Brand</h3>
<ul>
<li>
<a href="http://www.xyz.com/cellphones/nokia/nokia/259-33902/buy">Nokia</a>
</li>
<li>
<li>
<li>
<li>
<li>
<li>
<li>
<li class="more">
</ul>
</div>
<div>
<h3>By Seller</h3>
<ul>
<li>
<a id="att_296935_184059" class="attributeUrlReplacementTarget" href="http://www.xyz.com/cellphones/nokia/amazon-marketplace/296935-184059/buy">Amazon Marketplace</a>
<input id="att_296935_184059_replacement" type="hidden" value="http://www.xyz.com/cellphones/nokia/amazon-marketplace/296935-184059/buy">
</li>
<li>
<li>
<li>
<li>
<li>
<li>
<li>
<li class="more">
</ul>
</div>
<div>
<div>
</div>


Output required in csv:

By Brands
Nokia
Samsung
.
.

By Seller
Amazon
Buy.com
.
.
.



Please suggest how to fetch details.

Sheetal Singh

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

How to pick content from html using beatifulsoup Sheetal Singh <sheetalsingh@shopzilla.com> - 2012-07-10 04:02 +0000

csiph-web