Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99484 > unrolled thread
| Started by | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| First post | 2015-11-25 12:42 -0800 |
| Last post | 2015-11-26 14:49 +0000 |
| Articles | 17 — 7 participants |
Back to article view | Back to comp.lang.python
Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-25 12:42 -0800 |
| Subject | Screen scraper to get all 'a title' elements |
| Message-ID | <23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com> |
Hello experts. I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
I'm trying to figure out how to list all 'a title' elements. For instance, I see the following:
<a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
<a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
<a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
<a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
So, I tried putting a script together to get 'title'. Here's my attempt.
import requests
import sys
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)
All that does is get the title of the page. I tried to get the links from that url, with this script.
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
That doesn't work wither. Basically, I'd like to see this.
Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx
How can I do that?
Thanks all!!
[toc] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2015-11-25 20:55 +0000 |
| Message-ID | <mailman.96.1448484959.20593.python-list@python.org> |
| In reply to | #99484 |
On 2015-11-25 20:42, ryguy7272 wrote:
> Hello experts. I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements. For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
>
> So, I tried putting a script together to get 'title'. Here's my attempt.
>
> import requests
> import sys
> from bs4 import BeautifulSoup
>
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> source_code = requests.get(url)
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
> print(link)
>
> All that does is get the title of the page. I tried to get the links from that url, with this script.
>
A 'title' element has the form "<title ...>". What you should be looking
for are 'a' elements, those of the form "<a ...>".
> import urllib2
> import re
>
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>
> #read html code
> html = website.read()
>
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
>
> print links
>
> That doesn't work wither. Basically, I'd like to see this.
>
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
>
> How can I do that?
> Thanks all!!
>
>
[toc] | [prev] | [next] | [standalone]
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Date | 2015-11-25 23:30 +0100 |
| Message-ID | <n35ckk$9q0$1@dont-email.me> |
| In reply to | #99488 |
Hi
It seems that links on that Wikipedia page follow the structure :
<a href="..." title="...">
You could extract a list of link titles with something like :
re.findall( r'\<a[^>]+title="(.+?)"', html )
HTH,
-Grobu-
On 25/11/15 21:55, MRAB wrote:
> On 2015-11-25 20:42, ryguy7272 wrote:
>> Hello experts. I'm looking at this url:
>> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>>
>> I'm trying to figure out how to list all 'a title' elements. For
>> instance, I see the following:
>> <a title="Accident, Maryland"
>> href="/wiki/Accident,_Maryland">Accident</a>
>> <a class="new" title="Ala-Lemu (page does not exist)"
>> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
>> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
>> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
>> Peaks</a>
>>
>> So, I tried putting a script together to get 'title'. Here's my attempt.
>>
>> import requests
>> import sys
>> from bs4 import BeautifulSoup
>>
>> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>> source_code = requests.get(url)
>> plain_text = source_code.text
>> soup = BeautifulSoup(plain_text)
>> for link in soup.findAll('title'):
>> print(link)
>>
>> All that does is get the title of the page. I tried to get the links
>> from that url, with this script.
>>
> A 'title' element has the form "<title ...>". What you should be looking
> for are 'a' elements, those of the form "<a ...>".
>
>> import urllib2
>> import re
>>
>> #connect to a URL
>> website =
>> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>>
>>
>> #read html code
>> html = website.read()
>>
>> #use re.findall to get all the links
>> links = re.findall('"((http|ftp)s?://.*?)"', html)
>>
>> print links
>>
>> That doesn't work wither. Basically, I'd like to see this.
>>
>> Accident
>> Ala-Lemu
>> Alert
>> Apocalypse Peaks
>> Athol
>> Å
>> Barbecue
>> Båstad
>> Bastardstown
>> Batman
>> Bathmen (Battem), Netherlands
>> ...
>> Worms
>> Yell
>> Zigzag
>> Zzyzx
>>
>> How can I do that?
>> Thanks all!!
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-25 14:48 -0800 |
| Message-ID | <c1e43997-0da3-4b93-b9af-98a2568eff9d@googlegroups.com> |
| In reply to | #99498 |
On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote:
> Hi
>
> It seems that links on that Wikipedia page follow the structure :
> <a href="..." title="...">
>
> You could extract a list of link titles with something like :
> re.findall( r'\<a[^>]+title="(.+?)"', html )
>
> HTH,
>
> -Grobu-
>
>
> On 25/11/15 21:55, MRAB wrote:
> > On 2015-11-25 20:42, ryguy7272 wrote:
> >> Hello experts. I'm looking at this url:
> >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >>
> >> I'm trying to figure out how to list all 'a title' elements. For
> >> instance, I see the following:
> >> <a title="Accident, Maryland"
> >> href="/wiki/Accident,_Maryland">Accident</a>
> >> <a class="new" title="Ala-Lemu (page does not exist)"
> >> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> >> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> >> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
> >> Peaks</a>
> >>
> >> So, I tried putting a script together to get 'title'. Here's my attempt.
> >>
> >> import requests
> >> import sys
> >> from bs4 import BeautifulSoup
> >>
> >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> >> source_code = requests.get(url)
> >> plain_text = source_code.text
> >> soup = BeautifulSoup(plain_text)
> >> for link in soup.findAll('title'):
> >> print(link)
> >>
> >> All that does is get the title of the page. I tried to get the links
> >> from that url, with this script.
> >>
> > A 'title' element has the form "<title ...>". What you should be looking
> > for are 'a' elements, those of the form "<a ...>".
> >
> >> import urllib2
> >> import re
> >>
> >> #connect to a URL
> >> website =
> >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> >>
> >>
> >> #read html code
> >> html = website.read()
> >>
> >> #use re.findall to get all the links
> >> links = re.findall('"((http|ftp)s?://.*?)"', html)
> >>
> >> print links
> >>
> >> That doesn't work wither. Basically, I'd like to see this.
> >>
> >> Accident
> >> Ala-Lemu
> >> Alert
> >> Apocalypse Peaks
> >> Athol
> >> Å
> >> Barbecue
> >> Båstad
> >> Bastardstown
> >> Batman
> >> Bathmen (Battem), Netherlands
> >> ...
> >> Worms
> >> Yell
> >> Zigzag
> >> Zzyzx
> >>
> >> How can I do that?
> >> Thanks all!!
Thanks!! Is that regex? Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
Can you just please explain what it's doing???
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-26 10:06 +1100 |
| Message-ID | <mailman.103.1448492791.20593.python-list@python.org> |
| In reply to | #99502 |
On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote: > Thanks!! Is that regex? Can you explain exactly what it is doing? > Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. > > Can you just please explain what it's doing??? It's a trap! Don't use a regex to parse HTML, unless you're deliberately trying to entice young and innocent programmers to the dark side. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Date | 2015-11-26 00:44 +0100 |
| Message-ID | <n35h0v$stn$1@dont-email.me> |
| In reply to | #99504 |
On 26/11/15 00:06, Chris Angelico wrote: > On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote: >> Thanks!! Is that regex? Can you explain exactly what it is doing? >> Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. >> >> Can you just please explain what it's doing??? > > It's a trap! > > Don't use a regex to parse HTML, unless you're deliberately trying to > entice young and innocent programmers to the dark side. > > ChrisA > Sorry, I wasn't aware of regex being on the dark side :-) Now that you mention it, I suppose that their being complex and error-inducing could lead to broken code all too easily when there is a reliable, ready-made solution like BeautifulSoup.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-11-26 01:53 +0200 |
| Message-ID | <87y4dl3abt.fsf@elektro.pacujo.net> |
| In reply to | #99508 |
Grobu <snailcoder@retrosite.invalid>: > Sorry, I wasn't aware of regex being on the dark side :-) No, regular expressions are great for many purposes. Parsing context-free syntax isn't one of them. See: <URL: https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy> Most modern programming languages including HTML are context-free. Their structure is too rich for regular expressions to capture. Regular expressions can handle any regular language just fine. They are commonly used to define the lexical tokens of a language. Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-26 10:59 +1100 |
| Message-ID | <mailman.106.1448495963.20593.python-list@python.org> |
| In reply to | #99509 |
On Thu, Nov 26, 2015 at 10:53 AM, Marko Rauhamaa <marko@pacujo.net> wrote: > Regular expressions can handle any regular language just fine. They are > commonly used to define the lexical tokens of a language. Not sure about _defining_ them, but they're certainly often used to _recognize_ them, eg in syntax highlighters. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-26 10:54 +1100 |
| Message-ID | <mailman.105.1448495654.20593.python-list@python.org> |
| In reply to | #99508 |
On Thu, Nov 26, 2015 at 10:44 AM, Grobu <snailcoder@retrosite.invalid> wrote: > On 26/11/15 00:06, Chris Angelico wrote: >> >> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote: >>> >>> Thanks!! Is that regex? Can you explain exactly what it is doing? >>> Also, it seems to pick up a lot more than just the list I wanted, but >>> that's ok, I can see why it does that. >>> >>> Can you just please explain what it's doing??? >> >> >> It's a trap! >> >> Don't use a regex to parse HTML, unless you're deliberately trying to >> entice young and innocent programmers to the dark side. >> >> ChrisA >> > > Sorry, I wasn't aware of regex being on the dark side :-) > Now that you mention it, I suppose that their being complex and > error-inducing could lead to broken code all too easily when there is a > reliable, ready-made solution like BeautifulSoup. Regular expressions have their uses, but parsing HTML is not one of them. The most important use of a regex is letting an end user control the search pattern; it's a compact language for describing a variety of text search concepts. For hard-coded regular expressions, there are some places where they're very good, and a lot of places where they're the wrong tool for the job. And one of those wrong-tool-for-job places is parsing stuff that fundamentally cannot be parsed with regexes, such as HTML. You _need_ a proper parser, which is what Beautiful Soup is for. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Date | 2015-11-26 02:05 +0100 |
| Message-ID | <n35lo5$b06$1@dont-email.me> |
| In reply to | #99508 |
Chris, Marko, thank you both for your links and explanations!
[toc] | [prev] | [next] | [standalone]
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Date | 2015-11-26 00:33 +0100 |
| Message-ID | <n35gc3$r1k$1@dont-email.me> |
| In reply to | #99502 |
On 25/11/15 23:48, ryguy7272 wrote:
>> re.findall( r'\<a[^>]+title="(.+?)"', html )
[ ... ]
> Thanks!! Is that regex? Can you explain exactly what it is doing?
> Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
>
> Can you just please explain what it's doing???
>
Yes it's a regular expression. Because RegEx's use the backslash as an
escape character, it is advisable to use the "raw string" prefix (r
before single/double/triple quote. To illustrate it with an example :
>>> print "1\n2"
1
2
>>> print r"1\n2"
1\n2
As the backslash escape character is "neutralized" by the raw string,
you can use the usual RegEx syntax at leisure :
\<a[^>]+title="(.+?)"
\< was a mistake on my part, a single < is perfectly enough
[^>] is a class definition, and the caret (^) character indicates
negation. Thus it means : any character other than >
+ incidates repetition : one or more of the previous element
. will match just anything
.+" is a _greedy_ pattern that would match anything until it encountered
a double quote
The problem with a greedy pattern is that it doesn't stop at the first
match. To illustrate :
>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'
It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy
pattern :
>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'
It matches the first quote and stops looking for further matches after
the second quote.
Finally, the parentheses are used to indicate a capture group :
>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test"
class="test"' )
>>> a.groups()
('is', 'test')
You can find detailed explanations about Python regular expressions at
this page : https://docs.python.org/2/howto/regex.html
HTH,
-Grobu-
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-25 15:37 -0800 |
| Message-ID | <e55c49e2-539d-4f62-89f7-34fa3882ff59@googlegroups.com> |
| In reply to | #99505 |
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote:
> On 25/11/15 23:48, ryguy7272 wrote:
> >> re.findall( r'\<a[^>]+title="(.+?)"', html )
> [ ... ]
> > Thanks!! Is that regex? Can you explain exactly what it is doing?
> > Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
> >
> > Can you just please explain what it's doing???
> >
>
> Yes it's a regular expression. Because RegEx's use the backslash as an
> escape character, it is advisable to use the "raw string" prefix (r
> before single/double/triple quote. To illustrate it with an example :
> >>> print "1\n2"
> 1
> 2
> >>> print r"1\n2"
> 1\n2
> As the backslash escape character is "neutralized" by the raw string,
> you can use the usual RegEx syntax at leisure :
>
> \<a[^>]+title="(.+?)"
>
> \< was a mistake on my part, a single < is perfectly enough
> [^>] is a class definition, and the caret (^) character indicates
> negation. Thus it means : any character other than >
> + incidates repetition : one or more of the previous element
> . will match just anything
> .+" is a _greedy_ pattern that would match anything until it encountered
> a double quote
>
> The problem with a greedy pattern is that it doesn't stop at the first
> match. To illustrate :
> >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
> >>> a.group()
> '"this is a test" class="test"'
>
> It matches the first quote up to the last one.
> On the other hand, you can use the "?" modifier to specify a non-greedy
> pattern :
>
> >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
> '"this is a test"'
>
> It matches the first quote and stops looking for further matches after
> the second quote.
>
> Finally, the parentheses are used to indicate a capture group :
> >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test"
> class="test"' )
> >>> a.groups()
> ('is', 'test')
>
>
> You can find detailed explanations about Python regular expressions at
> this page : https://docs.python.org/2/howto/regex.html
>
> HTH,
>
> -Grobu-
Wow! Awesome! I bookmarked that link!
Thanks for everything!!!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-26 10:42 +1100 |
| Message-ID | <mailman.104.1448494958.20593.python-list@python.org> |
| In reply to | #99506 |
On Thu, Nov 26, 2015 at 10:37 AM, ryguy7272 <ryanshuell@gmail.com> wrote: > Wow! Awesome! I bookmarked that link! > Thanks for everything!!! Also bookmark this link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags And read it before you do any parsing of HTML using regular expressions. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-25 14:04 -0800 |
| Message-ID | <a6f3a0a7-acc3-46db-a36b-c3d774293347@googlegroups.com> |
| In reply to | #99484 |
On Wednesday, November 25, 2015 at 3:42:21 PM UTC-5, ryguy7272 wrote:
> Hello experts. I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements. For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
>
> So, I tried putting a script together to get 'title'. Here's my attempt.
>
> import requests
> import sys
> from bs4 import BeautifulSoup
>
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> source_code = requests.get(url)
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
> print(link)
>
> All that does is get the title of the page. I tried to get the links from that url, with this script.
>
> import urllib2
> import re
>
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>
> #read html code
> html = website.read()
>
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
>
> print links
>
> That doesn't work wither. Basically, I'd like to see this.
>
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
>
> How can I do that?
> Thanks all!!
Ok, I guess that makes sense. So, I just tried the script below, and got nothing...
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names")
soup = BeautifulSoup(r.content)
print soup.find_all("a",{"title"})
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-26 09:10 +1100 |
| Message-ID | <mailman.99.1448489447.20593.python-list@python.org> |
| In reply to | #99494 |
On Thu, Nov 26, 2015 at 9:04 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
> Ok, I guess that makes sense. So, I just tried the script below, and got nothing...
>
> import requests
> from bs4 import BeautifulSoup
>
> r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names")
> soup = BeautifulSoup(r.content)
> print soup.find_all("a",{"title"})
The second argument to find_all is supposed to be a dict, not a set,
and it's only useful if you want to put some restriction on the
titles. To simply enumerate all the titles, try this:
[a.get("title") for a in soup.find_all("a")]
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | TP <wingusr@gmail.com> |
|---|---|
| Date | 2015-11-25 17:15 -0800 |
| Message-ID | <mailman.108.1448500596.20593.python-list@python.org> |
| In reply to | #99484 |
On Wed, Nov 25, 2015 at 12:42 PM, ryguy7272 <ryanshuell@gmail.com> wrote: > Hello experts. I'm looking at this url: > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names Wildly offtopic but interesting, easy way to grab/analyze Wikipedia data using F# instead of Python http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/ In your particular case something like: open FSharp.Data let [<Literal>] wikiURL = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names" type PlaceNamesProvider = HtmlProvider<wikiURL> let placeNamesWiki = PlaceNamesProvider() for row in placeNamesWiki.Tables.``Short & medium length names``.Rows do printfn "%s" row.Column1
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2015-11-26 14:49 +0000 |
| Message-ID | <n37663$h02$1@dont-email.me> |
| In reply to | #99484 |
On Wed, 25 Nov 2015 12:42:00 -0800, ryguy7272 wrote:
> Hello experts. I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements.
a is the element tag, title is an attribute of the htmlanchorelement.
combining bs4 with python structures allows you to find all the specified
attributes of an element type, for example to find the class attributes
of all the paragraphs with a class attribute:
stuff = [p.attrs['class'] for p in soup.find_all('p') if 'class' in
p.attrs]
Then you can do this
for thing in stuff:
print thing
(Python 2.7)
This may be adaptable to your requirement.
--
Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web