Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.016 X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'python,': 0.02; 'api.': 0.05; 'app,': 0.09; 'parameter': 0.09; 'slow.': 0.09; 'url:tos': 0.09; 'cc:addr:python-list': 0.11; 'after,': 0.16; 'any.': 0.16; "api's": 0.16; 'backwards': 0.16; 'bend': 0.16; 'circumvent': 0.16; 'crawling': 0.16; 'doing,': 0.16; 'non-python': 0.16; 'parse,': 0.16; 'stuff.': 0.16; 'umesh': 0.16; 'index': 0.16; 'sat,': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'seems': 0.21; 'aug': 0.22; 'cc:addr:python.org': 0.22; 'creating': 0.23; 'refers': 0.24; 'question': 0.24; 'cc:2**0': 0.24; 'login': 0.25; '>': 0.26; 'post': 0.26; 'header:In-Reply-To:1': 0.27; 'tried': 0.27; "doesn't": 0.30; 'specified': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'url:mailman': 0.30; 'arrangements': 0.31; 'keywords,': 0.31; 'agreed': 0.32; 'url:python': 0.33; "can't": 0.35; 'created': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'google': 0.35; 'really': 0.36; 'keyword': 0.36; 'url:listinfo': 0.36; 'method': 0.36; 'url:org': 0.36; 'should': 0.36; 'clear': 0.37; 'skip:& 10': 0.38; 'same.': 0.38; 'pm,': 0.38; 'heard': 0.39; 'url:mail': 0.40; 'how': 0.40; 'read': 0.60; 'above,': 0.60; 'dave': 0.60; 'then,': 0.60; 'tell': 0.60; 'url:about': 0.61; 'browser': 0.61; "you're": 0.61; 'address': 0.63; 'term': 0.63; 'account': 0.65; 'offer,': 0.65; 'here': 0.66; 'content,': 0.68; 'stated': 0.69; 'legal': 0.71; 'scraping': 0.84; '\xa0at': 0.84; '\xa0but': 0.84; 'angel': 0.91; 'login.': 0.93; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=BGvCCY77aHx0n0RsfXELK9lYJGdEBBj0bdYeUcDmzIo=; b=cy9A4ktYzvDCWuYtJwQ/TsC/tlcBIZ79nC3nuLe7bptQQAdaskYsxQECiahZ0cGDDf rqOdARbegurWqf5JEIxPxuyqnwZt1SYN90B6jDZI35murqo9GzPRjkfxpwyDrAaDYCIL mRuhtcRJxi46Eb787SVePsVP2HQFwxOqRDOH1P2Co082rs/HBNIelhJOnW0rJTUby/Rh 2xHNhlOe9H6JNKEPyDwLEOw3Zmi6JE+ieKZxXCpGVNefNg0OxurslJMxyE2t27llStYk 43buRPchGgqNqn8dznuCC59p6PmaVAj6AF0CO0C8vYrA8HJrsXIGhKAabuzrDahzQwVX vqbA== MIME-Version: 1.0 X-Received: by 10.49.35.233 with SMTP id l9mr4092628qej.23.1375933911394; Wed, 07 Aug 2013 20:51:51 -0700 (PDT) In-Reply-To: References: <154b7f76-9491-4eb5-813b-c1d8c76cf054@googlegroups.com> Date: Wed, 7 Aug 2013 23:51:51 -0400 Subject: Re: Crawl Quora From: David Hutto To: Dave Angel Content-Type: multipart/alternative; boundary=047d7b671fa605ec2904e367946c Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 135 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1375933916 news.xs4all.nl 15887 [2001:888:2000:d::a6]:40946 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52163 --047d7b671fa605ec2904e367946c Content-Type: text/plain; charset=ISO-8859-1 Never tried this, but if it's not data you're after, but a search term type of app, then ip address crawl, and if keyword/metadata, then crawl, and parse, just as it seems you are doing, for keywords, and url's associated with them, then eliminate url's without that specified keyword parameter into your function. Then, of course, just as stated above, some sites won't let you have access in other ways, which you should be able to circumvent some way. On Sat, Aug 3, 2013 at 5:09 PM, Dave Angel wrote: > Umesh Sharma wrote: > > > Hello, > > > > I am writing a crawler in python, which crawl quora. I can't read the > content of quora without login. But google/bing crawls quora. One thing i > can do is use browser automation and login in my account and the go links > by link and crawl content, but this method is slow. So can any one tell me > how should i start in writing this crawler. > > > > > I had never heard of quora. And I had to hunt a bit to find a link to > this website. When you post a question here which refers to a > non-Python site, you really should include a link to it. > > You start with reading the page: http://www.quora.com/about/tos > > which you agreed to when you created your account with them. At one > place it seems pretty clear that unless you make specific arrangements > with Quora, you're limited to using their API. > > I suspect that they bend over backwards to get Google and the other big > names to index their stuff. But that doesn't make it legal for you to > do the same. > > In particular, the section labeled "Rules" makes constraints on > automated crawling. And so do other parts of the TOS. Crawling is > permissible, but not scraping. What's that mean? I dunno. Perhaps > scraping is what you're describing above as "method is slow." > > I'm going to be looking to see what API's they offer, if any. I'm > creating an account now. > > -- > DaveA > > -- > http://mail.python.org/mailman/listinfo/python-list > -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* --047d7b671fa605ec2904e367946c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Never tried this, but if it's not data you're= after, but a search term type of app, then ip address crawl, and if keywor= d/metadata, then crawl, and parse, just as it seems you are doing, for keyw= ords, and url's associated with them, then eliminate url's without = that specified keyword parameter into your function.

Then, of course, just as stated above, some sites won't let y= ou have access in other ways, which you should be able to circumvent some w= ay.



On Sat, Aug 3, 2013 at 5:09 PM, Dave Angel <davea@davea.name>= wrote:
Umesh Sharma wrote:

> Hello,
>
> I am writing a crawler in python, which crawl quora. I can't read = the content of quora without login. But google/bing crawls quora. One thing= i can do is use browser automation and login in my account and the go link= s by link and crawl content, but this method is slow. So can any one tell m= e how should i start in writing this crawler.
>
>
I had never heard of quora. =A0And I had to hunt a bit to find a link= to
this website. =A0When you post a question here which refers to a
non-Python site, you really should include a link to it.

You start with reading the page: =A0http://www.quora.com/about/tos

which you agreed to when you created your account with them. =A0At one
place it seems pretty clear that unless you make specific arrangements
with Quora, you're limited to using their API.

I suspect that they bend over backwards to get Google and the other big
names to index their stuff. =A0But that doesn't make it legal for you t= o
do the same.

In particular, the section labeled "Rules" makes constraints on automated crawling. =A0And so do other parts of the TOS. =A0Crawling is
permissible, but not scraping. =A0What's that mean? =A0I dunno. =A0Perh= aps
scraping is what you're describing above as "method is slow."=

I'm going to be looking to see what API's they offer, if any. =A0I&= #39;m
creating an account now.

--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list



--
Best Rega= rds,
David Hutto<= /span>
CEO: http://www.hitwebdevelopment.com
--047d7b671fa605ec2904e367946c--