Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #30615 > unrolled thread
| Started by | রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> |
|---|---|
| First post | 2012-10-01 17:44 +0100 |
| Last post | 2012-10-01 14:09 -0400 |
| Articles | 5 — 4 participants |
Back to article view | Back to comp.lang.python
get google scholar using python রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> - 2012-10-01 17:44 +0100
RE: get google scholar using python Nick Cash <nick.cash@npcinternational.com> - 2012-10-01 16:51 +0000
Re: get google scholar using python Grant Edwards <invalid@invalid.invalid> - 2012-10-01 17:19 +0000
Re: get google scholar using python রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> - 2012-10-01 18:28 +0100
Re: get google scholar using python Jerry Hill <malaclypse2@gmail.com> - 2012-10-01 14:09 -0400
| From | রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> |
|---|---|
| Date | 2012-10-01 17:44 +0100 |
| Subject | get google scholar using python |
| Message-ID | <1349109859.27817.7.camel@roddur> |
If I am trying to access a google scholar search result using python, I
get the following error(403):
$ python
Python 2.7.3 (default, Jul 24 2012, 10:05:38)
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from HTMLParser import HTMLParser
>>> import urllib2
response = urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 527, in
http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Will you kindly explain me the way to get rid of this?
[toc] | [next] | [standalone]
| From | Nick Cash <nick.cash@npcinternational.com> |
|---|---|
| Date | 2012-10-01 16:51 +0000 |
| Message-ID | <mailman.1710.1349110326.27098.python-list@python.org> |
| In reply to | #30615 |
> urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
>...
> urllib2.HTTPError: HTTP Error 403: Forbidden
> >>>
>
> Will you kindly explain me the way to get rid of this?
Looks like Google blocks non-browser user agents from retrieving this query. You *could* work around it by setting the User-Agent header to something fake that looks browser-ish, but you're almost certainly breaking Google's TOS if you do so.
Should you really really want to, urllib2 makes it easy:
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
-Nick Cash
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2012-10-01 17:19 +0000 |
| Message-ID | <k4cjbb$h4d$1@reader1.panix.com> |
| In reply to | #30616 |
On 2012-10-01, Nick Cash <nick.cash@npcinternational.com> wrote:
>> urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
>>...
>> urllib2.HTTPError: HTTP Error 403: Forbidden
>>
>> Will you kindly explain me the way to get rid of this?
>
> Looks like Google blocks non-browser user agents from retrieving this
> query. You *could* work around it by setting the User-Agent header to
> something fake that looks browser-ish, but you're almost certainly
> breaking Google's TOS if you do so.
I don't know about that particular Google service, but Google often
provides an API that's intended for use by non-browser programs.
Those interfaces are usually both easier to use for the programmer and
impose less load on the servers.
--
Grant Edwards grant.b.edwards Yow! I am deeply CONCERNED
at and I want something GOOD
gmail.com for BREAKFAST!
[toc] | [prev] | [next] | [standalone]
| From | রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> |
|---|---|
| Date | 2012-10-01 18:28 +0100 |
| Message-ID | <1349112522.1787.5.camel@roddur> |
| In reply to | #30616 |
I know one more python app that do the same thing
http://www.icir.org/christian/downloads/scholar.py
and few other app(Mendeley desktop) for which I found an explanation:
(from
http://academia.stackexchange.com/questions/2567/api-eula-and-scraping-for-google-scholar )
that:
"I know how Mendley uses it: they require you to click a button for each
individual search of Google Scholar. If they automatically did the
Google Scholar meta-data search for each paper when you import a
folder-full then they would violate the old Scholar EULA. That is why
they make you click for each query: if each query is accompanied by a
click and not part of some script or loop then it is in compliance with
the old EULA."
So, If I manage to use the User-Agent as shown by you, will I still
violating the google EULA?
This is my first try of scrapping HTML. So please help
On Mon, 2012-10-01 at 16:51 +0000, Nick Cash wrote:
> > urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
> >...
> > urllib2.HTTPError: HTTP Error 403: Forbidden
> > >>>
> >
> > Will you kindly explain me the way to get rid of this?
>
> Looks like Google blocks non-browser user agents from retrieving this query. You *could* work around it by setting the User-Agent header to something fake that looks browser-ish, but you're almost certainly breaking Google's TOS if you do so.
>
> Should you really really want to, urllib2 makes it easy:
> urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
>
> -Nick Cash
[toc] | [prev] | [next] | [standalone]
| From | Jerry Hill <malaclypse2@gmail.com> |
|---|---|
| Date | 2012-10-01 14:09 -0400 |
| Message-ID | <mailman.1712.1349114977.27098.python-list@python.org> |
| In reply to | #30619 |
On Mon, Oct 1, 2012 at 1:28 PM, রুদ্র ব্যাণার্জী <bnrj.rudra@gmail.com> wrote: > So, If I manage to use the User-Agent as shown by you, will I still > violating the google EULA? Very likely, yes. The overall Google Terms of Services (http://www.google.com/intl/en/policies/terms/) say "Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide." The only method that Google appears to allow for accessing Scholar is via the web interface, and they explicitly block web scraping through that interface, as you discovered. It's true that you can get around their block, but I believe that doing so violates the terms of service. Google does not appear to offer an API to access Scholar programatically, nor do I see a more specific EULA or TOS for the Scholar service beyond that general TOS document. That said, I am not a lawyer. If you want legal advice, you'll need to pay a lawyer for that advice. -- Jerry
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web