Groups > comp.lang.python > #51880 > unrolled thread

Crawl Quora

Started by	Umesh Sharma <usharma01@gmail.com>
First post	2013-08-03 12:01 -0700
Last post	2013-08-07 23:51 -0400
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  Crawl Quora Umesh Sharma <usharma01@gmail.com> - 2013-08-03 12:01 -0700
    Re: Crawl Quora Dave Angel <davea@davea.name> - 2013-08-03 21:09 +0000
    Re: Crawl Quora David Hutto <dwightdhutto@gmail.com> - 2013-08-07 23:51 -0400

#51880 — Crawl Quora

From	Umesh Sharma <usharma01@gmail.com>
Date	2013-08-03 12:01 -0700
Subject	Crawl Quora
Message-ID	<154b7f76-9491-4eb5-813b-c1d8c76cf054@googlegroups.com>

Hello,

I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.


Thanks,
Umesh Kumar Sharma

[toc] | [next] | [standalone]

#51890

From	Dave Angel <davea@davea.name>
Date	2013-08-03 21:09 +0000
Message-ID	<mailman.172.1375564169.1251.python-list@python.org>
In reply to	#51880

Umesh Sharma wrote:

> Hello,
>
> I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.
>
>
I had never heard of quora.  And I had to hunt a bit to find a link to
this website.  When you post a question here which refers to a
non-Python site, you really should include a link to it.

You start with reading the page:  http://www.quora.com/about/tos

which you agreed to when you created your account with them.  At one
place it seems pretty clear that unless you make specific arrangements
with Quora, you're limited to using their API.

I suspect that they bend over backwards to get Google and the other big
names to index their stuff.  But that doesn't make it legal for you to
do the same.

In particular, the section labeled "Rules" makes constraints on
automated crawling.  And so do other parts of the TOS.  Crawling is
permissible, but not scraping.  What's that mean?  I dunno.  Perhaps
scraping is what you're describing above as "method is slow."

I'm going to be looking to see what API's they offer, if any.  I'm
creating an account now.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#52163

From	David Hutto <dwightdhutto@gmail.com>
Date	2013-08-07 23:51 -0400
Message-ID	<mailman.335.1375933916.1251.python-list@python.org>
In reply to	#51880

[Multipart message — attachments visible in raw view] — view raw

Never tried this, but if it's not data you're after, but a search term type
of app, then ip address crawl, and if keyword/metadata, then crawl, and
parse, just as it seems you are doing, for keywords, and url's associated
with them, then eliminate url's without that specified keyword parameter
into your function.

Then, of course, just as stated above, some sites won't let you have access
in other ways, which you should be able to circumvent some way.



On Sat, Aug 3, 2013 at 5:09 PM, Dave Angel <davea@davea.name> wrote:

> Umesh Sharma wrote:
>
> > Hello,
> >
> > I am writing a crawler in python, which crawl quora. I can't read the
> content of quora without login. But google/bing crawls quora. One thing i
> can do is use browser automation and login in my account and the go links
> by link and crawl content, but this method is slow. So can any one tell me
> how should i start in writing this crawler.
> >
> >
> I had never heard of quora.  And I had to hunt a bit to find a link to
> this website.  When you post a question here which refers to a
> non-Python site, you really should include a link to it.
>
> You start with reading the page:  http://www.quora.com/about/tos
>
> which you agreed to when you created your account with them.  At one
> place it seems pretty clear that unless you make specific arrangements
> with Quora, you're limited to using their API.
>
> I suspect that they bend over backwards to get Google and the other big
> names to index their stuff.  But that doesn't make it legal for you to
> do the same.
>
> In particular, the section labeled "Rules" makes constraints on
> automated crawling.  And so do other parts of the TOS.  Crawling is
> permissible, but not scraping.  What's that mean?  I dunno.  Perhaps
> scraping is what you're describing above as "method is slow."
>
> I'm going to be looking to see what API's they offer, if any.  I'm
> creating an account now.
>
> --
> DaveA
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*

[toc] | [prev] | [standalone]

csiph-web

Crawl Quora

Contents

#51880 — Crawl Quora

#51890

#52163