Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99678 > unrolled thread

Does Python allow variables to be passed into function for dynamic screen scraping?

Started byryguy7272 <ryanshuell@gmail.com>
First post2015-11-28 14:03 -0800
Last post2015-11-28 20:52 -0800
Articles 6 — 3 participants

Back to article view | Back to comp.lang.python


Contents

  Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 14:03 -0800
    Re: Does Python allow variables to be passed into function for dynamic screen scraping? Laura Creighton <lac@openend.se> - 2015-11-28 23:28 +0100
      Re: Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 14:37 -0800
        Re: Does Python allow variables to be passed into function for dynamic screen scraping? Laura Creighton <lac@openend.se> - 2015-11-28 23:44 +0100
    Re: Does Python allow variables to be passed into function for dynamic screen scraping? Steven D'Aprano <steve@pearwood.info> - 2015-11-29 12:58 +1100
      Re: Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 20:52 -0800

#99678 — Does Python allow variables to be passed into function for dynamic screen scraping?

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-28 14:03 -0800
SubjectDoes Python allow variables to be passed into function for dynamic screen scraping?
Message-ID<e13afc4b-ac4e-4a75-bca6-1c7be9399cb6@googlegroups.com>
I'm looking at this URL.
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

If I hit F12 I can see tags such as these:
<a title=
<a class=
And so on and so forth.  

I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results.  I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None

Here's the script that I'm playing around with.

import requests
import pandas as pd
from bs4 import BeautifulSoup

#Get the relevant webpage set the data up for parsing
url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")

#set up a function to parse the "soup" for each category of information and put it in a DataFrame
def get_match_info(soup,tag,class_name):
    info_array=[]
    for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
        return pd.DataFrame(info_array)

#for each category pass the above function the relevant information i.e. tag names
tag1 = get_match_info(soup,"td","title")
tag2 = get_match_info(soup,"td","class")

#Concatenate the DataFrames to present a final table of all the above info 
match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)

print match_info

I'd greatly appreciate any help with this.

[toc] | [next] | [standalone]


#99680

FromLaura Creighton <lac@openend.se>
Date2015-11-28 23:28 +0100
Message-ID<mailman.1.1448749716.14615.python-list@python.org>
In reply to#99678
In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
>I'm looking at this URL.
>https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
>If I hit F12 I can see tags such as these:
><a title=
><a class=
>And so on and so forth.  
>
>I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results.  I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
>
>Here's the script that I'm playing around with.
>
>import requests
>import pandas as pd
>from bs4 import BeautifulSoup
>
>#Get the relevant webpage set the data up for parsing
>url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>r = requests.get(url)
>soup=BeautifulSoup(r.content,"lxml")
>
>#set up a function to parse the "soup" for each category of information and put it in a DataFrame
>def get_match_info(soup,tag,class_name):
>    info_array=[]
>    for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
>        return pd.DataFrame(info_array)
>
>#for each category pass the above function the relevant information i.e. tag names
>tag1 = get_match_info(soup,"td","title")
>tag2 = get_match_info(soup,"td","class")
>
>#Concatenate the DataFrames to present a final table of all the above info 
>match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
>
>print match_info
>
>I'd greatly appreciate any help with this.

Post your error traceback.  If you are getting Value Errors about None,
then probably something you expect to return a match, isn't.  But without
the actual error, we cannot help much.

Laura

[toc] | [prev] | [next] | [standalone]


#99681

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-28 14:37 -0800
Message-ID<48f7bb74-93f0-4bf8-b781-e7f4b2daf032@googlegroups.com>
In reply to#99680
On Saturday, November 28, 2015 at 5:28:55 PM UTC-5, Laura Creighton wrote:
> In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
> >I'm looking at this URL.
> >https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >
> >If I hit F12 I can see tags such as these:
> ><a title=
> ><a class=
> >And so on and so forth.  
> >
> >I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results.  I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
> >
> >Here's the script that I'm playing around with.
> >
> >import requests
> >import pandas as pd
> >from bs4 import BeautifulSoup
> >
> >#Get the relevant webpage set the data up for parsing
> >url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> >r = requests.get(url)
> >soup=BeautifulSoup(r.content,"lxml")
> >
> >#set up a function to parse the "soup" for each category of information and put it in a DataFrame
> >def get_match_info(soup,tag,class_name):
> >    info_array=[]
> >    for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
> >        return pd.DataFrame(info_array)
> >
> >#for each category pass the above function the relevant information i.e. tag names
> >tag1 = get_match_info(soup,"td","title")
> >tag2 = get_match_info(soup,"td","class")
> >
> >#Concatenate the DataFrames to present a final table of all the above info 
> >match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
> >
> >print match_info
> >
> >I'd greatly appreciate any help with this.
> 
> Post your error traceback.  If you are getting Value Errors about None,
> then probably something you expect to return a match, isn't.  But without
> the actual error, we cannot help much.
> 
> Laura


Ok.  How do I post the error traceback?  I'm using Spyder Python 2.7.

[toc] | [prev] | [next] | [standalone]


#99682

FromLaura Creighton <lac@openend.se>
Date2015-11-28 23:44 +0100
Message-ID<mailman.2.1448750672.14615.python-list@python.org>
In reply to#99681
In a message of Sat, 28 Nov 2015 14:37:26 -0800, ryguy7272 writes:
>On Saturday, November 28, 2015 at 5:28:55 PM UTC-5, Laura Creighton wrote:
>> In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
>> >I'm looking at this URL.
>> >https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>> >
>> >If I hit F12 I can see tags such as these:
>> ><a title=
>> ><a class=
>> >And so on and so forth.  
>> >
>> >I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results.  I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
>> >
>> >Here's the script that I'm playing around with.
>> >
>> >import requests
>> >import pandas as pd
>> >from bs4 import BeautifulSoup
>> >
>> >#Get the relevant webpage set the data up for parsing
>> >url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>> >r = requests.get(url)
>> >soup=BeautifulSoup(r.content,"lxml")
>> >
>> >#set up a function to parse the "soup" for each category of information and put it in a DataFrame
>> >def get_match_info(soup,tag,class_name):
>> >    info_array=[]
>> >    for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
>> >        return pd.DataFrame(info_array)
>> >
>> >#for each category pass the above function the relevant information i.e. tag names
>> >tag1 = get_match_info(soup,"td","title")
>> >tag2 = get_match_info(soup,"td","class")
>> >
>> >#Concatenate the DataFrames to present a final table of all the above info 
>> >match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
>> >
>> >print match_info
>> >
>> >I'd greatly appreciate any help with this.
>> 
>> Post your error traceback.  If you are getting Value Errors about None,
>> then probably something you expect to return a match, isn't.  But without
>> the actual error, we cannot help much.
>> 
>> Laura
>
>
>Ok.  How do I post the error traceback?  I'm using Spyder Python 2.7.

You cut and paste it out of wherever you are reading it, and paste it
into the email, along with your code, also cut and pasted from somewhere
(like an editor).  That way we get the exact code that caused the exact
traceback you are getting.

Laura

[toc] | [prev] | [next] | [standalone]


#99684

FromSteven D'Aprano <steve@pearwood.info>
Date2015-11-29 12:58 +1100
Message-ID<565a5bd5$0$1606$c3e8da3$5496439d@news.astraweb.com>
In reply to#99678
On Sun, 29 Nov 2015 09:03 am, ryguy7272 wrote:

> I'm looking at this URL.
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

Don't screen-scrape Wikipedia. Just don't. They have an official API for
downloading content, use it. There's even a Python library for downloading
from Wikipedia and other Mediawiki sites:

https://www.mediawiki.org/wiki/Manual:Pywikibot

Wikimedia does a fantastic job, for free, and automated screen-scraping
hurts their ability to provide that service. It is rude and anti-social.
Please don't do it.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#99687

Fromryguy7272 <ryanshuell@gmail.com>
Date2015-11-28 20:52 -0800
Message-ID<2e25f82c-1ed6-4c56-836d-8f9a25990ea9@googlegroups.com>
In reply to#99684
On Saturday, November 28, 2015 at 8:59:04 PM UTC-5, Steven D'Aprano wrote:
> On Sun, 29 Nov 2015 09:03 am, ryguy7272 wrote:
> 
> > I'm looking at this URL.
> > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> 
> Don't screen-scrape Wikipedia. Just don't. They have an official API for
> downloading content, use it. There's even a Python library for downloading
> from Wikipedia and other Mediawiki sites:
> 
> https://www.mediawiki.org/wiki/Manual:Pywikibot
> 
> Wikimedia does a fantastic job, for free, and automated screen-scraping
> hurts their ability to provide that service. It is rude and anti-social.
> Please don't do it.
> 
> 
> 
> -- 
> Steven

Thanks Steven.  Do you know of a good tutorial for learning about Wikipedia APIs?  I'm not sure where to get started on this topic.  I did some Google searches, but didn't come up with a lot of useful info...not much actually...

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web