Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #67950 > unrolled thread

beautiful soup get class info

Started byteddybubu@gmail.com
First post2014-03-06 12:22 -0800
Last post2014-03-12 08:36 +0100
Articles 8 — 5 participants

Back to article view | Back to comp.lang.python


Contents

  beautiful soup get class info teddybubu@gmail.com - 2014-03-06 12:22 -0800
    Re: beautiful soup get class info John Gordon <gordon@panix.com> - 2014-03-06 20:58 +0000
      Re: beautiful soup get class info teddybubu@gmail.com - 2014-03-06 13:38 -0800
        Re: beautiful soup get class info John Gordon <gordon@panix.com> - 2014-03-06 22:28 +0000
          Re: beautiful soup get class info teddybubu@gmail.com - 2014-03-06 17:37 -0800
            Re: beautiful soup get class info Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-03-07 01:48 +0000
    Re: beautiful soup get class info Christopher Welborn <cjwelborn@live.com> - 2014-03-11 21:04 -0500
    Re: beautiful soup get class info Peter Otten <__peter__@web.de> - 2014-03-12 08:36 +0100

#67950 — beautiful soup get class info

Fromteddybubu@gmail.com
Date2014-03-06 12:22 -0800
Subjectbeautiful soup get class info
Message-ID<e73d29eb-17bb-472e-bdc4-c38ca904c60f@googlegroups.com>
I am using beautifulsoup to get the title and date of the website.
title is working fine but I am not able to pull the date. Here is the code in the url:

 <span class="date">October 22, 2011</span>

In Python, I am using the following code:
date1 = soup.span.text
data=soup.find_all(date="value") 

Results in:

[]
March 5, 2014

What is the proper way to get this info?
Thanks.

[toc] | [next] | [standalone]


#67951

FromJohn Gordon <gordon@panix.com>
Date2014-03-06 20:58 +0000
Message-ID<lfanh4$mna$1@reader1.panix.com>
In reply to#67950
In <e73d29eb-17bb-472e-bdc4-c38ca904c60f@googlegroups.com> teddybubu@gmail.com writes:

>  <span class="date">October 22, 2011</span>

> date1 = soup.span.text
> data=soup.find_all(date="value") 

Try this:

soup.find_all(name="span", class="date")

-- 
John Gordon         Imagine what it must be like for a real medical doctor to
gordon@panix.com    watch 'House', or a real serial killer to watch 'Dexter'.

[toc] | [prev] | [next] | [standalone]


#67952

Fromteddybubu@gmail.com
Date2014-03-06 13:38 -0800
Message-ID<ae5b837c-501d-498e-bd3a-3b2c709c42b0@googlegroups.com>
In reply to#67951
On Thursday, March 6, 2014 2:58:12 PM UTC-6, John Gordon wrote:
> In <e73d29eb-17bb-472e-bdc4-c38ca904c60f@googlegroups.com> teddy writes:
> 
> 
> 
> >  <span class="date">October 22, 2011</span>
> 
> 
> 
> > date1 = soup.span.text
> 
> > data=soup.find_all(date="value") 
> 
> 
> 
> Try this:
> 
> 
> 
> soup.find_all(name="span", class="date")
> 
> 
> 
> -- 
> 
> John Gordon         Imagine what it must be like for a real medical doctor to
> 
>     watch 'House', or a real serial killer to watch 'Dexter'.

I have python 2.7.2 and it does not like class in the code you provided. Now when I take out [ class="date"], this is returned:
   [<span class="date">March 5, 2014</span>, <span class="date">March 5, 2014</span>]
 
This is the code I am using: "data = soup.find_all(name="span") 
print (data)"
1. it returns today's date instead of the actual date
2. returns it twice

[toc] | [prev] | [next] | [standalone]


#67958

FromJohn Gordon <gordon@panix.com>
Date2014-03-06 22:28 +0000
Message-ID<lfaspm$998$1@reader1.panix.com>
In reply to#67952
In <ae5b837c-501d-498e-bd3a-3b2c709c42b0@googlegroups.com> teddybubu@gmail.com writes:

> > soup.find_all(name="span", class="date")

> I have python 2.7.2 and it does not like class in the code you provided.

Oh right, 'class' is a reserved word.  I imagine beautifulsoup has
a workaround for that.

> Now when I take out [ class="date"], this is returned:
>    [<span class="date">March 5, 2014</span>, <span class="date">March 5, 2014</span>]
>  
> This is the code I am using: "data = soup.find_all(name="span") 
> print (data)"
> 1. it returns today's date instead of the actual date
> 2. returns it twice

Are there two occurrences of '<span class="date">March 5, 2014</span>'
in the HTML?  If so, then beautifulsoup is doing its job correctly.

It might help if you posted the sample HTML data you're working with.

-- 
John Gordon         Imagine what it must be like for a real medical doctor to
gordon@panix.com    watch 'House', or a real serial killer to watch 'Dexter'.

[toc] | [prev] | [next] | [standalone]


#67971

Fromteddybubu@gmail.com
Date2014-03-06 17:37 -0800
Message-ID<c303cbad-d790-43ce-a88d-2068ec8e371c@googlegroups.com>
In reply to#67958
On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote:
> In <ae5b837c-501d-498e-bd3a-3b2c709c42b0@googlegroups.com>  writes:
> 
> 
> 
> > > soup.find_all(name="span", class="date")
> 
> 
> 
> > I have python 2.7.2 and it does not like class in the code you provided.
> 
> 
> 
> Oh right, 'class' is a reserved word.  I imagine beautifulsoup has
> 
> a workaround for that.
> 
> 
> 
> > Now when I take out [ class="date"], this is returned:
> 
> >    [<span class="date">March 5, 2014</span>, <span class="date">March 5, 2014</span>]
> 
> >  
> 
> > This is the code I am using: "data = soup.find_all(name="span") 
> 
> > print (data)"
> 
> > 1. it returns today's date instead of the actual date
> 
> > 2. returns it twice
> 
> 
> 
> Are there two occurrences of '<span class="date">March 5, 2014</span>'
> 
> in the HTML?  If so, then beautifulsoup is doing its job correctly.
> 
> 
> 
> It might help if you posted the sample HTML data you're working with.
> 
> 
> 
> -- 
> 
> John Gordon         Imagine what it must be like for a real medical doctor to
> 
>    watch 'House', or a real serial killer to watch 'Dexter'.

ok I got this working. now to the next problem.... thanks.

[toc] | [prev] | [next] | [standalone]


#67974

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-03-07 01:48 +0000
Message-ID<mailman.7886.1394156960.18130.python-list@python.org>
In reply to#67971
On 07/03/2014 01:37, teddybubu@gmail.com wrote:
> On Thursday, March 6, 2014 4:28:06 PM UTC-6, John Gordon wrote:
>> In <ae5b837c-501d-498e-bd3a-3b2c709c42b0@googlegroups.com>  writes:
>>
>>
>>
>>>> soup.find_all(name="span", class="date")
>>
>>
>>
>>> I have python 2.7.2 and it does not like class in the code you provided.
>>
>>
>>
>> Oh right, 'class' is a reserved word.  I imagine beautifulsoup has
>>
>> a workaround for that.
>>
>>
>>
>>> Now when I take out [ class="date"], this is returned:
>>
>>>     [<span class="date">March 5, 2014</span>, <span class="date">March 5, 2014</span>]
>>
>>>
>>
>>> This is the code I am using: "data = soup.find_all(name="span")
>>
>>> print (data)"
>>
>>> 1. it returns today's date instead of the actual date
>>
>>> 2. returns it twice
>>
>>
>>
>> Are there two occurrences of '<span class="date">March 5, 2014</span>'
>>
>> in the HTML?  If so, then beautifulsoup is doing its job correctly.
>>
>>
>>
>> It might help if you posted the sample HTML data you're working with.
>>
>>
>>
>> --
>>
>> John Gordon         Imagine what it must be like for a real medical doctor to
>>
>>     watch 'House', or a real serial killer to watch 'Dexter'.
>
> ok I got this working. now to the next problem.... thanks.
>

I'm pleased to see that you have a solution.  Now, should you wish to 
ask further questions, would you please read and action this first 
https://wiki.python.org/moin/GoogleGroupsPython to prevent us seeing the 
double line spacing above, thanks.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com

[toc] | [prev] | [next] | [standalone]


#68257

FromChristopher Welborn <cjwelborn@live.com>
Date2014-03-11 21:04 -0500
Message-ID<mailman.8069.1394589869.18130.python-list@python.org>
In reply to#67950
On 03/06/2014 02:22 PM, teddybubu@gmail.com wrote:
> I am using beautifulsoup to get the title and date of the website.
> title is working fine but I am not able to pull the date. Here is the code in the url:
>
>   <span class="date">October 22, 2011</span>
>
> In Python, I am using the following code:
> date1 = soup.span.text
> data=soup.find_all(date="value")
>
> Results in:
>
> []
> March 5, 2014
>
> What is the proper way to get this info?
> Thanks.
>

I believe it's the 'attrs' argument.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Workaround the 'class' problem:
data = soup.find_all(attrs={'class': 'date'})

I haven't tested it, but it's worth looking into.

-- 
\¯\      /¯/\
  \ \/¯¯\/ / / Christopher Welborn (cj)
   \__/\__/ /  cjwelborn at live·com
    \__/\__/   http://welbornprod.com

[toc] | [prev] | [next] | [standalone]


#68262

FromPeter Otten <__peter__@web.de>
Date2014-03-12 08:36 +0100
Message-ID<mailman.8074.1394609816.18130.python-list@python.org>
In reply to#67950
Christopher Welborn wrote:

> On 03/06/2014 02:22 PM, teddybubu@gmail.com wrote:
>> I am using beautifulsoup to get the title and date of the website.
>> title is working fine but I am not able to pull the date. Here is the
>> code in the url:
>>
>>   <span class="date">October 22, 2011</span>
>>
>> In Python, I am using the following code:
>> date1 = soup.span.text
>> data=soup.find_all(date="value")
>>
>> Results in:
>>
>> []
>> March 5, 2014
>>
>> What is the proper way to get this info?
>> Thanks.
>>
> 
> I believe it's the 'attrs' argument.
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/
> 
> # Workaround the 'class' problem:
> data = soup.find_all(attrs={'class': 'date'})
> 
> I haven't tested it, but it's worth looking into.
 
Yes there are two ways to filtr by class:

>>> soup = bs4.BeautifulSoup("""
... <span class="one">alpha</span>
... <span class="two">beta</span>""")

Use attrs:

>>> soup.find_all(attrs={"class": "one"})
[<span class="one">alpha</span>]

Append an underscore:

>>> soup.find_all(class_="two")
[<span class="two">beta</span>]

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web