Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102015 > unrolled thread
| Started by | inhahe <inhahe@gmail.com> |
|---|---|
| First post | 2016-01-22 09:01 -0500 |
| Last post | 2016-01-22 19:59 -0800 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Question about how to do something in BeautifulSoup? inhahe <inhahe@gmail.com> - 2016-01-22 09:01 -0500
Re: Question about how to do something in BeautifulSoup? "Mario R. Osorio" <nimbiotics@gmail.com> - 2016-01-22 19:59 -0800
| From | inhahe <inhahe@gmail.com> |
|---|---|
| Date | 2016-01-22 09:01 -0500 |
| Subject | Question about how to do something in BeautifulSoup? |
| Message-ID | <mailman.167.1453471306.15297.python-list@python.org> |
I hope this is an appropriate mailing list for BeautifulSoup questions, it's been a long time since I've used python-list and I don't remember if third-party modules are on topic. I did try posting to the BeautifulSoup mailing list on Google groups, but I've waited a day or two and my message hasn't been approved yet. Say I have the following HTML (I hope this shows up as plain text here rather than formatting): <div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is today the day?"</strong></em></span></div> And I want to extract the "Is today the day?" part. There are other places in the document with <em> and <strong>, but this is the only place that uses color #000000, so I want to extract anything that's within a color #000000 style, even if it's nested multiple levels deep within that. - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined as #000000 - Sometimes the <strong> is within the <em> and sometimes the <em> is within the <strong>. - There may be other discrepancies I haven't noticed yet How can I do this in BeautifulSoup (or is this better done in lxml.html)? Thanks
[toc] | [next] | [standalone]
| From | "Mario R. Osorio" <nimbiotics@gmail.com> |
|---|---|
| Date | 2016-01-22 19:59 -0800 |
| Message-ID | <1239e5bd-e0c3-4e13-b3a4-c5c59c2298aa@googlegroups.com> |
| In reply to | #102015 |
I think you'd do better using the pyparsing library On Friday, January 22, 2016 at 9:02:00 AM UTC-5, inhahe wrote: > I hope this is an appropriate mailing list for BeautifulSoup questions, > it's been a long time since I've used python-list and I don't remember if > third-party modules are on topic. I did try posting to the BeautifulSoup > mailing list on Google groups, but I've waited a day or two and my message > hasn't been approved yet. > > Say I have the following HTML (I hope this shows up as plain text here > rather than formatting): > > <div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is > today the day?"</strong></em></span></div> > > And I want to extract the "Is today the day?" part. There are other places > in the document with <em> and <strong>, but this is the only place that > uses color #000000, so I want to extract anything that's within a color > #000000 style, even if it's nested multiple levels deep within that. > > - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's defined > as #000000 > - Sometimes the <strong> is within the <em> and sometimes the <em> is > within the <strong>. > - There may be other discrepancies I haven't noticed yet > > How can I do this in BeautifulSoup (or is this better done in lxml.html)? > Thanks
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web