Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21279 > unrolled thread
| Started by | John Salerno <johnjsal@gmail.com> |
|---|---|
| First post | 2012-03-06 14:43 -0800 |
| Last post | 2012-03-09 02:45 -0800 |
| Articles | 20 on this page of 42 — 16 participants |
Back to article view | Back to comp.lang.python
What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 14:43 -0800
Re: What's the best way to write this regular expression? Chris Rebert <clp2@rebertia.com> - 2012-03-06 14:52 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:02 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:05 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:25 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:33 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:33 -0800
Re: What's the best way to write this regular expression? Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-06 16:35 -0700
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 17:39 -0600
Re: What's the best way to write this regular expression? Terry Reedy <tjreedy@udel.edu> - 2012-03-06 20:04 -0500
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:05 -0800
Re: What's the best way to write this regular expression? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-06 23:44 +0000
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:57 -0800
RE: What's the best way to write this regular expression? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-07 00:04 +0000
Re: What's the best way to write this regular expression? Terry Reedy <tjreedy@udel.edu> - 2012-03-06 20:06 -0500
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 15:02 -0800
Re: What's the best way to write this regular expression? Roy Smith <roy@panix.com> - 2012-03-06 20:26 -0500
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-06 23:02 -0800
Re: What's the best way to write this regular expression? Paul Rubin <no.email@nospam.invalid> - 2012-03-07 02:36 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-07 12:39 -0800
Re: What's the best way to write this regular expression? Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-07 14:01 -0700
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-07 15:11 -0600
Re: What's the best way to write this regular expression? alex23 <wuwei23@gmail.com> - 2012-03-08 19:38 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-08 19:52 -0800
Re: What's the best way to write this regular expression? Benjamin Kaplan <benjamin.kaplan@case.edu> - 2012-03-07 16:27 -0500
RE: What's the best way to write this regular expression? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-07 21:31 +0000
Re: What's the best way to write this regular expression? Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-07 14:34 -0700
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-07 15:44 -0600
Re: RE: What's the best way to write this regular expression? Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-07 16:02 -0600
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-07 23:26 -0800
Re: What's the best way to write this regular expression? Chris Angelico <rosuav@gmail.com> - 2012-03-08 16:03 +1100
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-07 23:25 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-08 13:33 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-08 13:40 -0800
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-08 13:52 -0800
Re: What's the best way to write this regular expression? John Gordon <gordon@panix.com> - 2012-03-08 21:54 +0000
Re: What's the best way to write this regular expression? Dave Angel <d@davea.name> - 2012-03-08 17:19 -0500
Re: What's the best way to write this regular expression? John Salerno <johnjsal@gmail.com> - 2012-03-08 16:25 -0600
RE: What's the best way to write this regular expression? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-08 23:02 +0000
Re: What's the best way to write this regular expression? Dave Angel <d@davea.name> - 2012-03-08 18:23 -0500
Re: What's the best way to write this regular expression? Ethan Furman <ethan@stoneleaf.us> - 2012-03-08 14:52 -0800
Re: What's the best way to write this regular expression? jkn <jkn_gg@nicorp.f9.co.uk> - 2012-03-09 02:45 -0800
Page 1 of 3 [1] 2 3 Next page →
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 14:43 -0800 |
| Subject | What's the best way to write this regular expression? |
| Message-ID | <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4> |
I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.
This is my RE:
song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)
This is how the website is formatted:
4:25 p.m.
</div><div class="cmPlaylistContent"><strong><a href="/lsp/t24435/">AP TX SOC CPAS TRF</a></strong><br /><br /></div></li><li ><div class="cmPlaylistTime">
4:21 p.m.
</div><div class="cmPlaylistContent"><strong><a href="/lsp/t7672/">No One Else On Earth</a></strong><br /><a href="/lsp/a1924/">Wynonna</a><br /></div></li><li ><div class="cmPlaylistTime">
4:19 p.m.
</div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp100/p109/p10901ruw7x_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Moe Bandy" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t15101/">It' A Cheating Situation</a></strong><br /><a href="/lsp/a5307/">Moe Bandy</a><br /><span class="sprite iconVoteUp">Votes (1) </span></div></li><li ><div class="cmPlaylistTime">
4:15 p.m.
</div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp700/p744/p74493d85qy_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Reba McEntire" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t14437/">Somebody Should Leave</a></strong><br /><a href="/lsp/a396/">REBA McENTIRE</a> & <a href="/lsp/a5765/">LINDA DAVIS</a><br /></div></li><li ><div class="cmPlaylistTime">
There's something of a pattern, although it's not always perfect. The time is listed first, and then the song information in <a> tags. However, in this particular case, you can see that for the 4:25pm entry, "AP TX SOC CPAS TRF" is extracted for the song title, and then the RE skips to the next entry in order to find the next <a> tags, which is actually the name of the next song in the list, instead of being the artist as normal. (Of course, I have no idea what AP TX SOC CPAS TRF is anyway. Usually the website doesn't list commercials or anomalies like that.)
So my first question is basic: am I even extracting the information properly? It works almost all the time, but because the website is such a mess, I pretty much have to rely on the tags being in the proper places (as they were NOT in this case!).
The second question is, to fix the above problem, would it be sufficient to rewrite my RE so that it has to find all of the specified information, i.e. a time followed by two <a> entries, BEFORE it moves on to finding the next time? I think that would have caused it to skip the 4:25 entry above, and only extract entries that have a time followed by two <a> entries (song and artist).
If this is possible, how do I rewrite it so that it has to match all the conditions without skipping over the next time entry in order to do so?
Thanks.
[toc] | [next] | [standalone]
| From | Chris Rebert <clp2@rebertia.com> |
|---|---|
| Date | 2012-03-06 14:52 -0800 |
| Message-ID | <mailman.442.1331074333.3037.python-list@python.org> |
| In reply to | #21279 |
On Tue, Mar 6, 2012 at 2:43 PM, John Salerno <johnjsal@gmail.com> wrote:
> I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.
>
> This is my RE:
>
> song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)
I would advise against using regular expressions to "parse" HTML:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
lxml is a popular choice for parsing HTML in Python: http://lxml.de
Cheers,
Chris
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:02 -0800 |
| Message-ID | <mailman.443.1331074966.3037.python-list@python.org> |
| In reply to | #21280 |
On Tuesday, March 6, 2012 4:52:10 PM UTC-6, Chris Rebert wrote:
> On Tue, Mar 6, 2012 at 2:43 PM, John Salerno <johnjsal@gmail.com> wrote:
> > I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.
> >
> > This is my RE:
> >
> > song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)
>
> I would advise against using regular expressions to "parse" HTML:
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>
> lxml is a popular choice for parsing HTML in Python: http://lxml.de
>
> Cheers,
> Chris
Thanks, that was an interesting read :)
Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :)
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:05 -0800 |
| Message-ID | <28285433.1413.1331075139309.JavaMail.geo-discussion-forums@ynbq18> |
| In reply to | #21281 |
> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. Thanks.
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:25 -0800 |
| Message-ID | <10944614.4302.1331076340313.JavaMail.geo-discussion-forums@ynne2> |
| In reply to | #21282 |
On Tuesday, March 6, 2012 5:05:39 PM UTC-6, John Salerno wrote: > > Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. > > Thanks. Also, I just noticed Beautiful Soup, which seems appropriate. I suppose any will do, but knowing the pros and cons would help with a decision.
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:33 -0800 |
| Message-ID | <28368927.1236.1331076818291.JavaMail.geo-discussion-forums@yner4> |
| In reply to | #21282 |
On Tuesday, March 6, 2012 5:05:39 PM UTC-6, John Salerno wrote: > > Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. > > Thanks. ::sigh:: I'm having some trouble with the new Google Groups interface. It seems to double post, and in this case didn't post at all. If it did already, I apologize. I'll try to figure out what's happening, or just switch to a real newsgroup program. Anyway, my question was about Beautiful Soup. I read on the doc page that BS uses a parser, which html.parser and lxml are. So I'm guessing the difference between them is that the parser is a little more "low level," whereas BS offers a higher level approach to using them? Is BS easier to write code with, while still using the power of lxml?
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:33 -0800 |
| Message-ID | <mailman.446.1331076821.3037.python-list@python.org> |
| In reply to | #21282 |
On Tuesday, March 6, 2012 5:05:39 PM UTC-6, John Salerno wrote: > > Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. > > Thanks. ::sigh:: I'm having some trouble with the new Google Groups interface. It seems to double post, and in this case didn't post at all. If it did already, I apologize. I'll try to figure out what's happening, or just switch to a real newsgroup program. Anyway, my question was about Beautiful Soup. I read on the doc page that BS uses a parser, which html.parser and lxml are. So I'm guessing the difference between them is that the parser is a little more "low level," whereas BS offers a higher level approach to using them? Is BS easier to write code with, while still using the power of lxml?
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-03-06 16:35 -0700 |
| Message-ID | <mailman.447.1331076963.3037.python-list@python.org> |
| In reply to | #21282 |
On Tue, Mar 6, 2012 at 4:05 PM, John Salerno <johnjsal@gmail.com> wrote: >> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. HTMLParser is pretty basic, although it may be sufficient for your needs. It just converts an html document into a stream of start tags, end tags, and text, with no guarantee that the tags will actually correspond in any meaningful way. lxml can be used to output an actual hierarchical structure that may be easier to manipulate and extract data from. Cheers, Ian
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 17:39 -0600 |
| Message-ID | <mailman.448.1331077211.3037.python-list@python.org> |
| In reply to | #21282 |
Thanks. I'm thinking the choice might be between lxml and Beautiful Soup, but since BS uses lxml as a parser, I'm trying to figure out the difference between them. I don't necessarily need the simplest (html.parser), but I want to choose one that is simple enough yet powerful enough that I won't have to learn another method later. On Tue, Mar 6, 2012 at 5:35 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > On Tue, Mar 6, 2012 at 4:05 PM, John Salerno <johnjsal@gmail.com> wrote: >>> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) >> >> I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. > > HTMLParser is pretty basic, although it may be sufficient for your > needs. It just converts an html document into a stream of start tags, > end tags, and text, with no guarantee that the tags will actually > correspond in any meaningful way. lxml can be used to output an > actual hierarchical structure that may be easier to manipulate and > extract data from. > > Cheers, > Ian
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-03-06 20:04 -0500 |
| Message-ID | <mailman.452.1331082305.3037.python-list@python.org> |
| In reply to | #21282 |
On 3/6/2012 6:05 PM, John Salerno wrote: >> Anything that allows me NOT to use REs is welcome news, so I look >> forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with > Python that I could use? lxml is +- upward compatible with xml.etree in the stdlib. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:05 -0800 |
| Message-ID | <mailman.444.1331075142.3037.python-list@python.org> |
| In reply to | #21281 |
> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :) I should ask though...are there alternatives already bundled with Python that I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked into that before I messed with REs. Thanks.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-03-06 23:44 +0000 |
| Message-ID | <4f56a146$0$29989$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #21283 |
On Tue, 06 Mar 2012 15:05:39 -0800, John Salerno wrote: >> Anything that allows me NOT to use REs is welcome news, so I look >> forward to learning about something new! :) > > I should ask though...are there alternatives already bundled with Python > that I could use? Now that you mention it, I remember something called > HTMLParser (or something like that) and I have no idea why I never > looked into that before I messed with REs. import htmllib help(htmllib) The help is pretty minimal and technical, you might like to google on a tutorial or two: https://duckduckgo.com/html/?q=python%20htmllib%20tutorial Also, you're still double-posting. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:57 -0800 |
| Message-ID | <9496185.1483.1331078235707.JavaMail.geo-discussion-forums@ynbo9> |
| In reply to | #21291 |
> Also, you're still double-posting. Grr. I just reported it to Google, but I think if I start to frequent the newsgroup again I'll have to switch to Thunderbird, or perhaps I'll just try switching back to the old Google Groups interface. I think the issue is the new interface. Sorry.
[toc] | [prev] | [next] | [standalone]
| From | "Prasad, Ramit" <ramit.prasad@jpmorgan.com> |
|---|---|
| Date | 2012-03-07 00:04 +0000 |
| Message-ID | <mailman.449.1331079512.3037.python-list@python.org> |
| In reply to | #21292 |
> > > Also, you're still double-posting. > > Grr. I just reported it to Google, but I think if I start to frequent the > newsgroup again I'll have to switch to Thunderbird, or perhaps I'll just > try switching back to the old Google Groups interface. I think the issue is > the new interface. > > Sorry. Oddly, I see no double posting for this thread on my end (email list). Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-03-06 20:06 -0500 |
| Message-ID | <mailman.453.1331082608.3037.python-list@python.org> |
| In reply to | #21292 |
On 3/6/2012 6:57 PM, John Salerno wrote: >> Also, you're still double-posting. > > Grr. I just reported it to Google, but I think if I start to frequent > the newsgroup again I'll have to switch to Thunderbird, or perhaps > I'll just try switching back to the old Google Groups interface. I > think the issue is the new interface. I am not seeing the double posting, but I use Thunderbird + the news.gmane.org mirrors of python-list and others. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 15:02 -0800 |
| Message-ID | <15622243.3244.1331074963027.JavaMail.geo-discussion-forums@yncc26> |
| In reply to | #21280 |
On Tuesday, March 6, 2012 4:52:10 PM UTC-6, Chris Rebert wrote:
> On Tue, Mar 6, 2012 at 2:43 PM, John Salerno <johnjsal@gmail.com> wrote:
> > I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.
> >
> > This is my RE:
> >
> > song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)
>
> I would advise against using regular expressions to "parse" HTML:
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>
> lxml is a popular choice for parsing HTML in Python: http://lxml.de
>
> Cheers,
> Chris
Thanks, that was an interesting read :)
Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new! :)
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-03-06 20:26 -0500 |
| Message-ID | <roy-7D61C2.20261106032012@news.panix.com> |
| In reply to | #21279 |
In article <12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4>, John Salerno <johnjsal@gmail.com> wrote: > I sort of have to work with what the website gives me (as you'll see below), > but today I encountered an exception to my RE. Let me just give all the > specific information first. The point of my script is to go to the specified > URL and extract song information from it. Rule #1: Don't try to parse XML, HTML, or any other kind of ML with regular expressions. Rule #2: Use a dedicated ML parser. I like lxml (http://lxml.de/). There's other possibilities. Rule #3: If in doubt, see rule #1.
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-06 23:02 -0800 |
| Message-ID | <0c1a1890-dc80-41b6-abea-f90324dd7d75@2g2000yqk.googlegroups.com> |
| In reply to | #21279 |
After a bit of reading, I've decided to use Beautiful Soup 4, with lxml as the parser. I considered simply using lxml to do all the work, but I just got lost in the documentation and tutorials. I couldn't find a clear explanation of how to parse an HTML file and then navigate its structure. The Beautiful Soup 4 documentation was very clear, and BS4 itself is so simple and Pythonic. And best of all, since version 4 no longer does the parsing itself, you can choose your own parser, and it works with lxml, so I'll still be using lxml, but with a nice, clean overlay for navigating the tree structure. Thanks for the advice!
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-03-07 02:36 -0800 |
| Message-ID | <7x7gywofzh.fsf@ruckus.brouhaha.com> |
| In reply to | #21306 |
John Salerno <johnjsal@gmail.com> writes: > The Beautiful Soup 4 documentation was very clear, and BS4 itself is > so simple and Pythonic. And best of all, since version 4 no longer > does the parsing itself, you can choose your own parser, and it works > with lxml, so I'll still be using lxml, but with a nice, clean overlay > for navigating the tree structure. I haven't used BS4 but have made good use of earlier versions. Main thing to understand is that an awful lot of HTML in the real world is malformed and will break an XML parser or anything that expects syntactically invalid HTML. People tend to write HTML that works well enough to render decently in browsers, whose parsers therefore have to be tolerant of bad errors. Beautiful Soup also tries to make sense of crappy, malformed, HTML. Partly as a result, it's dog slow compared to any serious XML parser. But it works very well if you don't mind the low speed.
[toc] | [prev] | [next] | [standalone]
| From | John Salerno <johnjsal@gmail.com> |
|---|---|
| Date | 2012-03-07 12:39 -0800 |
| Message-ID | <dfbc4383-8a94-4907-a841-51e72226b0bd@o16g2000yqg.googlegroups.com> |
| In reply to | #21279 |
Ok, first major roadblock. I have no idea how to install Beautiful Soup or lxml on Windows! All I can find are .tar files. Based on what I've read, I can use the easy_setup module to install these types of files, but when I went to download the setuptools package, it only seemed to support Python 2.7. I'm using 3.2. Is 2.7 just the minimum version it requires? It didn't say something like "2.7+", so I wasn't sure, and I don't want to start installing a bunch of stuff that will clog up my directories and not even work. What's the best way for me to install these two packages? I've also seen a reference to using setup.py...is that a separate package too, or is that something that comes with Python by default? Thanks.
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web