Groups > comp.lang.python > #102015 > unrolled thread

Question about how to do something in BeautifulSoup?

Started by	inhahe <inhahe@gmail.com>
First post	2016-01-22 09:01 -0500
Last post	2016-01-22 19:59 -0800
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Question about how to do something in BeautifulSoup? inhahe <inhahe@gmail.com> - 2016-01-22 09:01 -0500
    Re: Question about how to do something in BeautifulSoup? "Mario R. Osorio" <nimbiotics@gmail.com> - 2016-01-22 19:59 -0800

#102015 — Question about how to do something in BeautifulSoup?

From	inhahe <inhahe@gmail.com>
Date	2016-01-22 09:01 -0500
Subject	Question about how to do something in BeautifulSoup?
Message-ID	<mailman.167.1453471306.15297.python-list@python.org>

I hope this is an appropriate mailing list for BeautifulSoup questions,
it's been a long time since I've used python-list and I don't remember if
third-party modules are on topic. I did try posting to the BeautifulSoup
mailing list on Google groups, but I've waited a day or two and my message
hasn't been approved yet.

Say I have the following HTML (I hope this shows up as plain text here
rather than formatting):

<div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is
today the day?"</strong></em></span></div>

And I want to extract the "Is today the day?" part. There are other places
in the document with <em> and <strong>, but this is the only place that
uses color #000000, so I want to extract anything that's within a color
#000000 style, even if it's nested multiple levels deep within that.

- Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined
as #000000
- Sometimes the <strong> is within the <em> and sometimes the <em> is
within the <strong>.
- There may be other discrepancies I haven't noticed yet

How can I do this in BeautifulSoup (or is this better done in lxml.html)?
Thanks

[toc] | [next] | [standalone]

#102027

From	"Mario R. Osorio" <nimbiotics@gmail.com>
Date	2016-01-22 19:59 -0800
Message-ID	<1239e5bd-e0c3-4e13-b3a4-c5c59c2298aa@googlegroups.com>
In reply to	#102015

I think you'd do better using the pyparsing library


On Friday, January 22, 2016 at 9:02:00 AM UTC-5, inhahe wrote:
> I hope this is an appropriate mailing list for BeautifulSoup questions,
> it's been a long time since I've used python-list and I don't remember if
> third-party modules are on topic. I did try posting to the BeautifulSoup
> mailing list on Google groups, but I've waited a day or two and my message
> hasn't been approved yet.
> 
> Say I have the following HTML (I hope this shows up as plain text here
> rather than formatting):
> 
> <div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is
> today the day?"</strong></em></span></div>
> 
> And I want to extract the "Is today the day?" part. There are other places
> in the document with <em> and <strong>, but this is the only place that
> uses color #000000, so I want to extract anything that's within a color
> #000000 style, even if it's nested multiple levels deep within that.
> 
> - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined
> as #000000
> - Sometimes the <strong> is within the <em> and sometimes the <em> is
> within the <strong>.
> - There may be other discrepancies I haven't noticed yet
> 
> How can I do this in BeautifulSoup (or is this better done in lxml.html)?
> Thanks

[toc] | [prev] | [standalone]

csiph-web

Question about how to do something in BeautifulSoup?

Contents

#102015 — Question about how to do something in BeautifulSoup?

#102027