Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99678 > unrolled thread
| Started by | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| First post | 2015-11-28 14:03 -0800 |
| Last post | 2015-11-28 20:52 -0800 |
| Articles | 6 — 3 participants |
Back to article view | Back to comp.lang.python
Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 14:03 -0800
Re: Does Python allow variables to be passed into function for dynamic screen scraping? Laura Creighton <lac@openend.se> - 2015-11-28 23:28 +0100
Re: Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 14:37 -0800
Re: Does Python allow variables to be passed into function for dynamic screen scraping? Laura Creighton <lac@openend.se> - 2015-11-28 23:44 +0100
Re: Does Python allow variables to be passed into function for dynamic screen scraping? Steven D'Aprano <steve@pearwood.info> - 2015-11-29 12:58 +1100
Re: Does Python allow variables to be passed into function for dynamic screen scraping? ryguy7272 <ryanshuell@gmail.com> - 2015-11-28 20:52 -0800
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-28 14:03 -0800 |
| Subject | Does Python allow variables to be passed into function for dynamic screen scraping? |
| Message-ID | <e13afc4b-ac4e-4a75-bca6-1c7be9399cb6@googlegroups.com> |
I'm looking at this URL.
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
If I hit F12 I can see tags such as these:
<a title=
<a class=
And so on and so forth.
I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results. I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
Here's the script that I'm playing around with.
import requests
import pandas as pd
from bs4 import BeautifulSoup
#Get the relevant webpage set the data up for parsing
url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
#set up a function to parse the "soup" for each category of information and put it in a DataFrame
def get_match_info(soup,tag,class_name):
info_array=[]
for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
return pd.DataFrame(info_array)
#for each category pass the above function the relevant information i.e. tag names
tag1 = get_match_info(soup,"td","title")
tag2 = get_match_info(soup,"td","class")
#Concatenate the DataFrames to present a final table of all the above info
match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
print match_info
I'd greatly appreciate any help with this.
[toc] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-11-28 23:28 +0100 |
| Message-ID | <mailman.1.1448749716.14615.python-list@python.org> |
| In reply to | #99678 |
In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
>I'm looking at this URL.
>https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
>If I hit F12 I can see tags such as these:
><a title=
><a class=
>And so on and so forth.
>
>I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results. I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
>
>Here's the script that I'm playing around with.
>
>import requests
>import pandas as pd
>from bs4 import BeautifulSoup
>
>#Get the relevant webpage set the data up for parsing
>url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>r = requests.get(url)
>soup=BeautifulSoup(r.content,"lxml")
>
>#set up a function to parse the "soup" for each category of information and put it in a DataFrame
>def get_match_info(soup,tag,class_name):
> info_array=[]
> for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
> return pd.DataFrame(info_array)
>
>#for each category pass the above function the relevant information i.e. tag names
>tag1 = get_match_info(soup,"td","title")
>tag2 = get_match_info(soup,"td","class")
>
>#Concatenate the DataFrames to present a final table of all the above info
>match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
>
>print match_info
>
>I'd greatly appreciate any help with this.
Post your error traceback. If you are getting Value Errors about None,
then probably something you expect to return a match, isn't. But without
the actual error, we cannot help much.
Laura
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-28 14:37 -0800 |
| Message-ID | <48f7bb74-93f0-4bf8-b781-e7f4b2daf032@googlegroups.com> |
| In reply to | #99680 |
On Saturday, November 28, 2015 at 5:28:55 PM UTC-5, Laura Creighton wrote:
> In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
> >I'm looking at this URL.
> >https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >
> >If I hit F12 I can see tags such as these:
> ><a title=
> ><a class=
> >And so on and so forth.
> >
> >I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results. I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
> >
> >Here's the script that I'm playing around with.
> >
> >import requests
> >import pandas as pd
> >from bs4 import BeautifulSoup
> >
> >#Get the relevant webpage set the data up for parsing
> >url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> >r = requests.get(url)
> >soup=BeautifulSoup(r.content,"lxml")
> >
> >#set up a function to parse the "soup" for each category of information and put it in a DataFrame
> >def get_match_info(soup,tag,class_name):
> > info_array=[]
> > for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
> > return pd.DataFrame(info_array)
> >
> >#for each category pass the above function the relevant information i.e. tag names
> >tag1 = get_match_info(soup,"td","title")
> >tag2 = get_match_info(soup,"td","class")
> >
> >#Concatenate the DataFrames to present a final table of all the above info
> >match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
> >
> >print match_info
> >
> >I'd greatly appreciate any help with this.
>
> Post your error traceback. If you are getting Value Errors about None,
> then probably something you expect to return a match, isn't. But without
> the actual error, we cannot help much.
>
> Laura
Ok. How do I post the error traceback? I'm using Spyder Python 2.7.
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-11-28 23:44 +0100 |
| Message-ID | <mailman.2.1448750672.14615.python-list@python.org> |
| In reply to | #99681 |
In a message of Sat, 28 Nov 2015 14:37:26 -0800, ryguy7272 writes:
>On Saturday, November 28, 2015 at 5:28:55 PM UTC-5, Laura Creighton wrote:
>> In a message of Sat, 28 Nov 2015 14:03:10 -0800, ryguy7272 writes:
>> >I'm looking at this URL.
>> >https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>> >
>> >If I hit F12 I can see tags such as these:
>> ><a title=
>> ><a class=
>> >And so on and so forth.
>> >
>> >I'm wondering if someone can share a script, or a function, that will allow me to pass in variables and download (or simply print) the results. I saw a sample online that I thought would work, and I made a few modifications but now I keep getting a message that says: ValueError: All objects passed were None
>> >
>> >Here's the script that I'm playing around with.
>> >
>> >import requests
>> >import pandas as pd
>> >from bs4 import BeautifulSoup
>> >
>> >#Get the relevant webpage set the data up for parsing
>> >url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
>> >r = requests.get(url)
>> >soup=BeautifulSoup(r.content,"lxml")
>> >
>> >#set up a function to parse the "soup" for each category of information and put it in a DataFrame
>> >def get_match_info(soup,tag,class_name):
>> > info_array=[]
>> > for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
>> > return pd.DataFrame(info_array)
>> >
>> >#for each category pass the above function the relevant information i.e. tag names
>> >tag1 = get_match_info(soup,"td","title")
>> >tag2 = get_match_info(soup,"td","class")
>> >
>> >#Concatenate the DataFrames to present a final table of all the above info
>> >match_info = pd.concat([tag1,tag2],ignore_index=False,axis=1)
>> >
>> >print match_info
>> >
>> >I'd greatly appreciate any help with this.
>>
>> Post your error traceback. If you are getting Value Errors about None,
>> then probably something you expect to return a match, isn't. But without
>> the actual error, we cannot help much.
>>
>> Laura
>
>
>Ok. How do I post the error traceback? I'm using Spyder Python 2.7.
You cut and paste it out of wherever you are reading it, and paste it
into the email, along with your code, also cut and pasted from somewhere
(like an editor). That way we get the exact code that caused the exact
traceback you are getting.
Laura
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2015-11-29 12:58 +1100 |
| Message-ID | <565a5bd5$0$1606$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #99678 |
On Sun, 29 Nov 2015 09:03 am, ryguy7272 wrote: > I'm looking at this URL. > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names Don't screen-scrape Wikipedia. Just don't. They have an official API for downloading content, use it. There's even a Python library for downloading from Wikipedia and other Mediawiki sites: https://www.mediawiki.org/wiki/Manual:Pywikibot Wikimedia does a fantastic job, for free, and automated screen-scraping hurts their ability to provide that service. It is rude and anti-social. Please don't do it. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | ryguy7272 <ryanshuell@gmail.com> |
|---|---|
| Date | 2015-11-28 20:52 -0800 |
| Message-ID | <2e25f82c-1ed6-4c56-836d-8f9a25990ea9@googlegroups.com> |
| In reply to | #99684 |
On Saturday, November 28, 2015 at 8:59:04 PM UTC-5, Steven D'Aprano wrote: > On Sun, 29 Nov 2015 09:03 am, ryguy7272 wrote: > > > I'm looking at this URL. > > https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > > Don't screen-scrape Wikipedia. Just don't. They have an official API for > downloading content, use it. There's even a Python library for downloading > from Wikipedia and other Mediawiki sites: > > https://www.mediawiki.org/wiki/Manual:Pywikibot > > Wikimedia does a fantastic job, for free, and automated screen-scraping > hurts their ability to provide that service. It is rude and anti-social. > Please don't do it. > > > > -- > Steven Thanks Steven. Do you know of a good tutorial for learning about Wikipedia APIs? I'm not sure where to get started on this topic. I did some Google searches, but didn't come up with a lot of useful info...not much actually...
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web