Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.063 X-Spam-Evidence: '*H*': 0.87; '*S*': 0.00; 'parsing': 0.09; 'subject:parsing': 0.09; 'toss': 0.09; 'fetch': 0.16; 'subject:based': 0.16; 'apps': 0.16; 'bit': 0.19; 'parse': 0.24; "i've": 0.25; 'subject:/': 0.26; 'point': 0.28; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'getting': 0.31; 'created': 0.35; 'possible.': 0.35; 'something': 0.35; "who's": 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'done': 0.36; 'thanks': 0.36; "i'll": 0.36; 'being': 0.38; 'skip:& 10': 0.38; 'to:addr:python-list': 0.38; 'rather': 0.38; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'even': 0.60; 'skip:u 10': 0.60; 'world.': 0.61; 'approaches': 0.68; 'person,': 0.68 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=silmer1k5+kgkpPrw7Ud9EfD+8wxtT5DYabtvl6Jt+E=; b=jDFynMqmGKqCdaSEYxesiu5UHFqe869V9p48dkcaWQSAa4o6fqR/jloT3d4aDd1HJg LO6e8yPxYf7fk3Tw0k+n/mO+vm1CU5Jlw09H1hUJlP0EdoXYh2p3qYk0j2MMhkskTkYY mejFHLdQ3N0NJ+1SBb8Hcm6R2ggzsBGUoOjemNvdKfIhMCEUu4EwgoGpEPfYce1r2SzZ RkIwf0J3ocNudDp3jEGPiBN9qN3c8xUhHM7F4uDWVV1FkByrDCag3G6AnrwVk8T3qXIf Nh0RtRRv6vTVA7xTBtPNcZTIOxr3LtqIRQvg++WIIgsLUKeA9CqV/NwDPY+6sOs9eWc2 DzLQ== MIME-Version: 1.0 X-Received: by 10.52.187.65 with SMTP id fq1mr3442761vdc.13.1376829637281; Sun, 18 Aug 2013 05:40:37 -0700 (PDT) Date: Sun, 18 Aug 2013 08:40:37 -0400 Subject: crawling/parsing a webpage based on dynamic javascript From: bruce To: python-list@python.org Content-Type: multipart/alternative; boundary=bcaec548a4437261ec04e43821b2 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 41 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1376829645 news.xs4all.nl 15946 [2001:888:2000:d::a6]:40544 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52656 --bcaec548a4437261ec04e43821b2 Content-Type: text/plain; charset=ISO-8859-1 Hi. Looking at using python/cerely/twisted to test in parsing a test site. Also looking at being able to parse a site created using dynamic javascript. I've got test apps to parse a site, but I'm interested in getting a better understanding of using multi-thread/multi-processing approaches to spin out as many fetch processes as possible. At the same time, I'm interested in understanding a bit better what's used for parsing the javascript pages in the py world. Also, rather than just point me to something like "scrapy", I'm actually interested in finding someone who's done this that I can talk to. Heck, for the right person, I'll even toss some cash your way!! Thanks --bcaec548a4437261ec04e43821b2 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi.

Looking at using= python/cerely/twisted to test in parsing a test site. Also looking at bein= g able to parse a site created using dynamic javascript.

I'= ;ve got test apps to parse a site, but I'm interested in getting a bett= er understanding of using multi-thread/multi-processing approaches to spin = out as many fetch processes as possible.

At the same time, I'm interested in understanding a bit better what= 's used for parsing the javascript pages in the py world.

= Also, rather than just point me to something like "scrapy", I'= ;m actually interested in finding someone who's done this that I can ta= lk to.

Heck, for the right person, I'll even toss some cash your way= !!

Thanks

--bcaec548a4437261ec04e43821b2--