Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #51890

Re: Crawl Quora

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.003
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'python,': 0.02; 'api.': 0.05; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'slow.': 0.09; 'url:tos': 0.09; 'any.': 0.16; "api's": 0.16; 'backwards': 0.16; 'bend': 0.16; 'crawling': 0.16; 'non-python': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'stuff.': 0.16; 'umesh': 0.16; 'index': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'seems': 0.21; 'creating': 0.23; 'header:User- Agent:1': 0.23; 'refers': 0.24; 'question': 0.24; 'login': 0.25; 'post': 0.26; 'header:X-Complaints-To:1': 0.27; "doesn't": 0.30; "i'm": 0.30; 'arrangements': 0.31; 'agreed': 0.32; "can't": 0.35; 'created': 0.35; 'but': 0.35; 'google': 0.35; 'really': 0.36; 'method': 0.36; 'charset:us-ascii': 0.36; 'should': 0.36; 'clear': 0.37; 'same.': 0.38; 'to:addr:python-list': 0.38; 'heard': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'read': 0.60; 'tell': 0.60; 'url:about': 0.61; 'browser': 0.61; "you're": 0.61; 'account': 0.65; 'offer,': 0.65; 'here': 0.66; 'content,': 0.68; 'legal': 0.71; 'scraping': 0.84; 'login.': 0.93
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Dave Angel <davea@davea.name>
Subject Re: Crawl Quora
Date Sat, 3 Aug 2013 21:09:05 +0000 (UTC)
References <154b7f76-9491-4eb5-813b-c1d8c76cf054@googlegroups.com>
Mime-Version 1.0
Content-Type text/plain; charset=US-ASCII
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host 174.32.174.32
User-Agent XPN/1.2.6 (Street Spirit ; Linux)
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.172.1375564169.1251.python-list@python.org> (permalink)
Lines 32
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1375564169 news.xs4all.nl 15919 [2001:888:2000:d::a6]:45499
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:51890

Show key headers only | View raw


Umesh Sharma wrote:

> Hello,
>
> I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.
>
>
I had never heard of quora.  And I had to hunt a bit to find a link to
this website.  When you post a question here which refers to a
non-Python site, you really should include a link to it.

You start with reading the page:  http://www.quora.com/about/tos

which you agreed to when you created your account with them.  At one
place it seems pretty clear that unless you make specific arrangements
with Quora, you're limited to using their API.

I suspect that they bend over backwards to get Google and the other big
names to index their stuff.  But that doesn't make it legal for you to
do the same.

In particular, the section labeled "Rules" makes constraints on
automated crawling.  And so do other parts of the TOS.  Crawling is
permissible, but not scraping.  What's that mean?  I dunno.  Perhaps
scraping is what you're describing above as "method is slow."

I'm going to be looking to see what API's they offer, if any.  I'm
creating an account now.

-- 
DaveA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Crawl Quora Umesh Sharma <usharma01@gmail.com> - 2013-08-03 12:01 -0700
  Re: Crawl Quora Dave Angel <davea@davea.name> - 2013-08-03 21:09 +0000
  Re: Crawl Quora David Hutto <dwightdhutto@gmail.com> - 2013-08-07 23:51 -0400

csiph-web