Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99506
| Newsgroups | comp.lang.python |
|---|---|
| Date | 2015-11-25 15:37 -0800 |
| References | <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> <mailman.96.1448484959.20593.python-list@python.org> <n35ckk$9q0$1@dont-email.me> <c1e43997-0da3-4b93-b9af-98a2568eff9d@googlegroups.com> <n35gc3$r1k$1@dont-email.me> |
| Message-ID | <e55c49e2-539d-4f62-89f7-34fa3882ff59@googlegroups.com> (permalink) |
| Subject | Re: Screen scraper to get all 'a title' elements |
| From | ryguy7272 <ryanshuell@gmail.com> |
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote:
> On 25/11/15 23:48, ryguy7272 wrote:
> >> re.findall( r'\<a[^>]+title="(.+?)"', html )
> [ ... ]
> > Thanks!! Is that regex? Can you explain exactly what it is doing?
> > Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
> >
> > Can you just please explain what it's doing???
> >
>
> Yes it's a regular expression. Because RegEx's use the backslash as an
> escape character, it is advisable to use the "raw string" prefix (r
> before single/double/triple quote. To illustrate it with an example :
> >>> print "1\n2"
> 1
> 2
> >>> print r"1\n2"
> 1\n2
> As the backslash escape character is "neutralized" by the raw string,
> you can use the usual RegEx syntax at leisure :
>
> \<a[^>]+title="(.+?)"
>
> \< was a mistake on my part, a single < is perfectly enough
> [^>] is a class definition, and the caret (^) character indicates
> negation. Thus it means : any character other than >
> + incidates repetition : one or more of the previous element
> . will match just anything
> .+" is a _greedy_ pattern that would match anything until it encountered
> a double quote
>
> The problem with a greedy pattern is that it doesn't stop at the first
> match. To illustrate :
> >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
> >>> a.group()
> '"this is a test" class="test"'
>
> It matches the first quote up to the last one.
> On the other hand, you can use the "?" modifier to specify a non-greedy
> pattern :
>
> >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
> '"this is a test"'
>
> It matches the first quote and stops looking for further matches after
> the second quote.
>
> Finally, the parentheses are used to indicate a capture group :
> >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test"
> class="test"' )
> >>> a.groups()
> ('is', 'test')
>
>
> You can find detailed explanations about Python regular expressions at
> this page : https://docs.python.org/2/howto/regex.html
>
> HTH,
>
> -Grobu-
Wow! Awesome! I bookmarked that link!
Thanks for everything!!!
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000
csiph-web