Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99484 > unrolled thread

Screen scraper to get all 'a title' elements

Started byryguy7272 <ryanshuell@gmail.com>
First post2015-11-25 12:42 -0800
Last post2015-11-26 14:49 +0000
Articles 17 — 7 participants

Back to article view | Back to comp.lang.python


Contents

  Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 12:42 -0800
    Re: Screen scraper to get all 'a title' elements MRAB <python@mrabarnett.plus.com> - 2015-11-25 20:55 +0000
      Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-25 23:30 +0100
        Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:48 -0800
          Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:06 +1100
            Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:44 +0100
              Re: Screen scraper to get all 'a title' elements Marko Rauhamaa <marko@pacujo.net> - 2015-11-26 01:53 +0200
                Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:59 +1100
              Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:54 +1100
              Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 02:05 +0100
          Re: Screen scraper to get all 'a title' elements Grobu <snailcoder@retrosite.invalid> - 2015-11-26 00:33 +0100
            Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 15:37 -0800
              Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 10:42 +1100
    Re: Screen scraper to get all 'a title' elements ryguy7272 <ryanshuell@gmail.com> - 2015-11-25 14:04 -0800
      Re: Screen scraper to get all 'a title' elements Chris Angelico <rosuav@gmail.com> - 2015-11-26 09:10 +1100
    Re: Screen scraper to get all 'a title' elements TP <wingusr@gmail.com> - 2015-11-25 17:15 -0800
    Re: Screen scraper to get all 'a title' elements Denis McMahon <denismfmcmahon@gmail.com> - 2015-11-26 14:49 +0000

#99484 — Screen scraper to get all 'a title' elements

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-25 12:42 -0800
SubjectScreen scraper to get all 'a title' elements
Message-ID<23ed6f4b-0ef2-4c9e-ade6-e597e7e03ca2@googlegroups.com>
Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For instance, I see the following:
<a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
<a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
<a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
<a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"     
source_code = requests.get(url) 
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
    print(link)

All that does is get the title of the page.  I tried to get the links from that url, with this script.

import urllib2
import re

#connect to a URL
website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!

[toc] | [next] | [standalone]


#99488

FromMRAB <python@mrabarnett.plus.com>
Date2015-11-25 20:55 +0000
Message-ID<mailman.96.1448484959.20593.python-list@python.org>
In reply to#99484
On 2015-11-25 20:42, ryguy7272 wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements.  For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
>
> So, I tried putting a script together to get 'title'.  Here's my attempt.
>
> import requests
> import sys
> from bs4 import BeautifulSoup
>
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> source_code = requests.get(url)
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
>      print(link)
>
> All that does is get the title of the page.  I tried to get the links from that url, with this script.
>
A 'title' element has the form "<title ...>". What you should be looking 
for are 'a' elements, those of the form "<a ...>".

> import urllib2
> import re
>
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>
> #read html code
> html = website.read()
>
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
>
> print links
>
> That doesn't work wither.  Basically, I'd like to see this.
>
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
>
> How can I do that?
> Thanks all!!
>
>

[toc] | [prev] | [next] | [standalone]


#99498

FromGrobu <snailcoder@retrosite.invalid>
Date2015-11-25 23:30 +0100
Message-ID<n35ckk$9q0$1@dont-email.me>
In reply to#99488
Hi

It seems that links on that Wikipedia page follow the structure :
<a href="..." title="...">

You could extract a list of link titles with something like :
re.findall( r'\<a[^>]+title="(.+?)"', html )

HTH,

-Grobu-


On 25/11/15 21:55, MRAB wrote:
> On 2015-11-25 20:42, ryguy7272 wrote:
>> Hello experts.  I'm looking at this url:
>> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>>
>> I'm trying to figure out how to list all 'a title' elements.  For
>> instance, I see the following:
>> <a title="Accident, Maryland"
>> href="/wiki/Accident,_Maryland">Accident</a>
>> <a class="new" title="Ala-Lemu (page does not exist)"
>> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
>> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
>> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
>> Peaks</a>
>>
>> So, I tried putting a script together to get 'title'.  Here's my attempt.
>>
>> import requests
>> import sys
>> from bs4 import BeautifulSoup
>>
>> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>> source_code = requests.get(url)
>> plain_text = source_code.text
>> soup = BeautifulSoup(plain_text)
>> for link in soup.findAll('title'):
>>      print(link)
>>
>> All that does is get the title of the page.  I tried to get the links
>> from that url, with this script.
>>
> A 'title' element has the form "<title ...>". What you should be looking
> for are 'a' elements, those of the form "<a ...>".
>
>> import urllib2
>> import re
>>
>> #connect to a URL
>> website =
>> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>>
>>
>> #read html code
>> html = website.read()
>>
>> #use re.findall to get all the links
>> links = re.findall('"((http|ftp)s?://.*?)"', html)
>>
>> print links
>>
>> That doesn't work wither.  Basically, I'd like to see this.
>>
>> Accident
>> Ala-Lemu
>> Alert
>> Apocalypse Peaks
>> Athol
>> Å
>> Barbecue
>> Båstad
>> Bastardstown
>> Batman
>> Bathmen (Battem), Netherlands
>> ...
>> Worms
>> Yell
>> Zigzag
>> Zzyzx
>>
>> How can I do that?
>> Thanks all!!

[toc] | [prev] | [next] | [standalone]


#99502

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-25 14:48 -0800
Message-ID<c1e43997-0da3-4b93-b9af-98a2568eff9d@googlegroups.com>
In reply to#99498
On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote:
> Hi
> 
> It seems that links on that Wikipedia page follow the structure :
> <a href="..." title="...">
> 
> You could extract a list of link titles with something like :
> re.findall( r'\<a[^>]+title="(.+?)"', html )
> 
> HTH,
> 
> -Grobu-
> 
> 
> On 25/11/15 21:55, MRAB wrote:
> > On 2015-11-25 20:42, ryguy7272 wrote:
> >> Hello experts.  I'm looking at this url:
> >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >>
> >> I'm trying to figure out how to list all 'a title' elements.  For
> >> instance, I see the following:
> >> <a title="Accident, Maryland"
> >> href="/wiki/Accident,_Maryland">Accident</a>
> >> <a class="new" title="Ala-Lemu (page does not exist)"
> >> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> >> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> >> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
> >> Peaks</a>
> >>
> >> So, I tried putting a script together to get 'title'.  Here's my attempt.
> >>
> >> import requests
> >> import sys
> >> from bs4 import BeautifulSoup
> >>
> >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> >> source_code = requests.get(url)
> >> plain_text = source_code.text
> >> soup = BeautifulSoup(plain_text)
> >> for link in soup.findAll('title'):
> >>      print(link)
> >>
> >> All that does is get the title of the page.  I tried to get the links
> >> from that url, with this script.
> >>
> > A 'title' element has the form "<title ...>". What you should be looking
> > for are 'a' elements, those of the form "<a ...>".
> >
> >> import urllib2
> >> import re
> >>
> >> #connect to a URL
> >> website =
> >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> >>
> >>
> >> #read html code
> >> html = website.read()
> >>
> >> #use re.findall to get all the links
> >> links = re.findall('"((http|ftp)s?://.*?)"', html)
> >>
> >> print links
> >>
> >> That doesn't work wither.  Basically, I'd like to see this.
> >>
> >> Accident
> >> Ala-Lemu
> >> Alert
> >> Apocalypse Peaks
> >> Athol
> >> Å
> >> Barbecue
> >> Båstad
> >> Bastardstown
> >> Batman
> >> Bathmen (Battem), Netherlands
> >> ...
> >> Worms
> >> Yell
> >> Zigzag
> >> Zzyzx
> >>
> >> How can I do that?
> >> Thanks all!!



Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.  

Can you just please explain what it's doing???

[toc] | [prev] | [next] | [standalone]


#99504

FromChris Angelico <rosuav@gmail.com>
Date2015-11-26 10:06 +1100
Message-ID<mailman.103.1448492791.20593.python-list@python.org>
In reply to#99502
On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
>
> Can you just please explain what it's doing???

It's a trap!

Don't use a regex to parse HTML, unless you're deliberately trying to
entice young and innocent programmers to the dark side.

ChrisA

[toc] | [prev] | [next] | [standalone]


#99508

FromGrobu <snailcoder@retrosite.invalid>
Date2015-11-26 00:44 +0100
Message-ID<n35h0v$stn$1@dont-email.me>
In reply to#99504
On 26/11/15 00:06, Chris Angelico wrote:
> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
>> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
>> Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
>>
>> Can you just please explain what it's doing???
>
> It's a trap!
>
> Don't use a regex to parse HTML, unless you're deliberately trying to
> entice young and innocent programmers to the dark side.
>
> ChrisA
>

Sorry, I wasn't aware of regex being on the dark side :-)
Now that you mention it, I suppose that their being complex and 
error-inducing could lead to broken code all too easily when there is a 
reliable, ready-made solution like BeautifulSoup.

[toc] | [prev] | [next] | [standalone]


#99509

FromMarko Rauhamaa <marko@pacujo.net>
Date2015-11-26 01:53 +0200
Message-ID<87y4dl3abt.fsf@elektro.pacujo.net>
In reply to#99508
Grobu <snailcoder@retrosite.invalid>:

> Sorry, I wasn't aware of regex being on the dark side :-)

No, regular expressions are great for many purposes. Parsing
context-free syntax isn't one of them.

See:

  <URL: https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy>

Most modern programming languages including HTML are context-free. Their
structure is too rich for regular expressions to capture.

Regular expressions can handle any regular language just fine. They are
commonly used to define the lexical tokens of a language.


Marko

[toc] | [prev] | [next] | [standalone]


#99511

FromChris Angelico <rosuav@gmail.com>
Date2015-11-26 10:59 +1100
Message-ID<mailman.106.1448495963.20593.python-list@python.org>
In reply to#99509
On Thu, Nov 26, 2015 at 10:53 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Regular expressions can handle any regular language just fine. They are
> commonly used to define the lexical tokens of a language.

Not sure about _defining_ them, but they're certainly often used to
_recognize_ them, eg in syntax highlighters.

ChrisA

[toc] | [prev] | [next] | [standalone]


#99510

FromChris Angelico <rosuav@gmail.com>
Date2015-11-26 10:54 +1100
Message-ID<mailman.105.1448495654.20593.python-list@python.org>
In reply to#99508
On Thu, Nov 26, 2015 at 10:44 AM, Grobu <snailcoder@retrosite.invalid> wrote:
> On 26/11/15 00:06, Chris Angelico wrote:
>>
>> On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
>>>
>>> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
>>> Also, it seems to pick up a lot more than just the list I wanted, but
>>> that's ok, I can see why it does that.
>>>
>>> Can you just please explain what it's doing???
>>
>>
>> It's a trap!
>>
>> Don't use a regex to parse HTML, unless you're deliberately trying to
>> entice young and innocent programmers to the dark side.
>>
>> ChrisA
>>
>
> Sorry, I wasn't aware of regex being on the dark side :-)
> Now that you mention it, I suppose that their being complex and
> error-inducing could lead to broken code all too easily when there is a
> reliable, ready-made solution like BeautifulSoup.

Regular expressions have their uses, but parsing HTML is not one of
them. The most important use of a regex is letting an end user control
the search pattern; it's a compact language for describing a variety
of text search concepts. For hard-coded regular expressions, there are
some places where they're very good, and a lot of places where they're
the wrong tool for the job. And one of those wrong-tool-for-job places
is parsing stuff that fundamentally cannot be parsed with regexes,
such as HTML. You _need_ a proper parser, which is what Beautiful Soup
is for.

ChrisA

[toc] | [prev] | [next] | [standalone]


#99516

FromGrobu <snailcoder@retrosite.invalid>
Date2015-11-26 02:05 +0100
Message-ID<n35lo5$b06$1@dont-email.me>
In reply to#99508
Chris, Marko, thank you both for your links and explanations!

[toc] | [prev] | [next] | [standalone]


#99505

FromGrobu <snailcoder@retrosite.invalid>
Date2015-11-26 00:33 +0100
Message-ID<n35gc3$r1k$1@dont-email.me>
In reply to#99502
On 25/11/15 23:48, ryguy7272 wrote:
>> re.findall( r'\<a[^>]+title="(.+?)"', html )
[ ... ]
> Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
>
> Can you just please explain what it's doing???
>

Yes it's a regular expression. Because RegEx's use the backslash as an 
escape character, it is advisable to use the "raw string" prefix (r 
before single/double/triple quote. To illustrate it with an example :
	>>> print "1\n2"
	1
	2
	>>> print r"1\n2"
	1\n2
As the backslash escape character is "neutralized" by the raw string, 
you can use the usual RegEx syntax at leisure :

\<a[^>]+title="(.+?)"

\<	was a mistake on my part, a single < is perfectly enough
[^>]	is a class definition, and the caret (^) character indicates 
negation. Thus it means : any character other than >
+	incidates repetition : one or more of the previous element
.	will match just anything
.+"	is a _greedy_ pattern that would match anything until it encountered 
a double quote

The problem with a greedy pattern is that it doesn't stop at the first 
match. To illustrate :
 >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
 >>> a.group()
'"this is a test" class="test"'

It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy 
pattern :

 >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'

It matches the first quote and stops looking for further matches after 
the second quote.

Finally, the parentheses are used to indicate a capture group :
 >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
class="test"' )
 >>> a.groups()
('is', 'test')


You can find detailed explanations about Python regular expressions at 
this page : https://docs.python.org/2/howto/regex.html

HTH,

-Grobu-

[toc] | [prev] | [next] | [standalone]


#99506

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-25 15:37 -0800
Message-ID<e55c49e2-539d-4f62-89f7-34fa3882ff59@googlegroups.com>
In reply to#99505
On Wednesday, November 25, 2015 at 6:34:00 PM UTC-5, Grobu wrote:
> On 25/11/15 23:48, ryguy7272 wrote:
> >> re.findall( r'\<a[^>]+title="(.+?)"', html )
> [ ... ]
> > Thanks!!  Is that regex?  Can you explain exactly what it is doing?
> > Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.
> >
> > Can you just please explain what it's doing???
> >
> 
> Yes it's a regular expression. Because RegEx's use the backslash as an 
> escape character, it is advisable to use the "raw string" prefix (r 
> before single/double/triple quote. To illustrate it with an example :
> 	>>> print "1\n2"
> 	1
> 	2
> 	>>> print r"1\n2"
> 	1\n2
> As the backslash escape character is "neutralized" by the raw string, 
> you can use the usual RegEx syntax at leisure :
> 
> \<a[^>]+title="(.+?)"
> 
> \<	was a mistake on my part, a single < is perfectly enough
> [^>]	is a class definition, and the caret (^) character indicates 
> negation. Thus it means : any character other than >
> +	incidates repetition : one or more of the previous element
> .	will match just anything
> .+"	is a _greedy_ pattern that would match anything until it encountered 
> a double quote
> 
> The problem with a greedy pattern is that it doesn't stop at the first 
> match. To illustrate :
>  >>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>  >>> a.group()
> '"this is a test" class="test"'
> 
> It matches the first quote up to the last one.
> On the other hand, you can use the "?" modifier to specify a non-greedy 
> pattern :
> 
>  >>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
> '"this is a test"'
> 
> It matches the first quote and stops looking for further matches after 
> the second quote.
> 
> Finally, the parentheses are used to indicate a capture group :
>  >>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
> class="test"' )
>  >>> a.groups()
> ('is', 'test')
> 
> 
> You can find detailed explanations about Python regular expressions at 
> this page : https://docs.python.org/2/howto/regex.html
> 
> HTH,
> 
> -Grobu-



Wow!  Awesome!  I bookmarked that link!  
Thanks for everything!!!

[toc] | [prev] | [next] | [standalone]


#99507

FromChris Angelico <rosuav@gmail.com>
Date2015-11-26 10:42 +1100
Message-ID<mailman.104.1448494958.20593.python-list@python.org>
In reply to#99506
On Thu, Nov 26, 2015 at 10:37 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
> Wow!  Awesome!  I bookmarked that link!
> Thanks for everything!!!

Also bookmark this link:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

And read it before you do any parsing of HTML using regular expressions.

ChrisA

[toc] | [prev] | [next] | [standalone]


#99494

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-25 14:04 -0800
Message-ID<a6f3a0a7-acc3-46db-a36b-c3d774293347@googlegroups.com>
In reply to#99484
On Wednesday, November 25, 2015 at 3:42:21 PM UTC-5, ryguy7272 wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> 
> I'm trying to figure out how to list all 'a title' elements.  For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
> 
> So, I tried putting a script together to get 'title'.  Here's my attempt.
> 
> import requests
> import sys
> from bs4 import BeautifulSoup
> 
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"     
> source_code = requests.get(url) 
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
>     print(link)
> 
> All that does is get the title of the page.  I tried to get the links from that url, with this script.
> 
> import urllib2
> import re
> 
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> 
> #read html code
> html = website.read()
> 
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
> 
> print links
> 
> That doesn't work wither.  Basically, I'd like to see this.
> 
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
> 
> How can I do that?
> Thanks all!!



Ok, I guess that makes sense.  So, I just tried the script below, and got nothing...

import requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names")
soup = BeautifulSoup(r.content)
print soup.find_all("a",{"title"})

[toc] | [prev] | [next] | [standalone]


#99496

FromChris Angelico <rosuav@gmail.com>
Date2015-11-26 09:10 +1100
Message-ID<mailman.99.1448489447.20593.python-list@python.org>
In reply to#99494
On Thu, Nov 26, 2015 at 9:04 AM, ryguy7272 <ryanshuell@gmail.com> wrote:
> Ok, I guess that makes sense.  So, I just tried the script below, and got nothing...
>
> import requests
> from bs4 import BeautifulSoup
>
> r = requests.get("https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names")
> soup = BeautifulSoup(r.content)
> print soup.find_all("a",{"title"})

The second argument to find_all is supposed to be a dict, not a set,
and it's only useful if you want to put some restriction on the
titles. To simply enumerate all the titles, try this:

[a.get("title") for a in soup.find_all("a")]

ChrisA

[toc] | [prev] | [next] | [standalone]


#99517

FromTP <wingusr@gmail.com>
Date2015-11-25 17:15 -0800
Message-ID<mailman.108.1448500596.20593.python-list@python.org>
In reply to#99484
On Wed, Nov 25, 2015 at 12:42 PM, ryguy7272 <ryanshuell@gmail.com> wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

Wildly offtopic but interesting, easy way to grab/analyze Wikipedia
data using F# instead of Python
http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/

In your particular case something like:

open FSharp.Data
let [<Literal>] wikiURL =
"https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
type PlaceNamesProvider = HtmlProvider<wikiURL>

let placeNamesWiki = PlaceNamesProvider()
for row in placeNamesWiki.Tables.``Short & medium length names``.Rows do
  printfn "%s" row.Column1

[toc] | [prev] | [next] | [standalone]


#99580

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-11-26 14:49 +0000
Message-ID<n37663$h02$1@dont-email.me>
In reply to#99484
On Wed, 25 Nov 2015 12:42:00 -0800, ryguy7272 wrote:

> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> 
> I'm trying to figure out how to list all 'a title' elements.

a is the element tag, title is an attribute of the htmlanchorelement.

combining bs4 with python structures allows you to find all the specified 
attributes of an element type, for example to find the class attributes 
of all the paragraphs with a class attribute:

stuff = [p.attrs['class'] for p in soup.find_all('p') if 'class' in 
p.attrs]

Then you can do this

for thing in stuff:
    print thing

(Python 2.7)

This may be adaptable to your requirement.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web