Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.006 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'python,': 0.02; 'subject:Python': 0.06; '21,': 0.07; 'subject:would': 0.07; 'welcome.': 0.07; 'apis': 0.09; 'pages.': 0.09; 'parsing': 0.09; 'subject:using': 0.09; 'api': 0.11; 'cc:addr:python-list': 0.11; 'journal,': 0.16; 'layout,': 0.16; 'retrieving': 0.16; 'do,': 0.16; 'wrote:': 0.18; 'wed,': 0.18; 'subject:page': 0.19; 'seems': 0.21; 'aug': 0.22; 'cc:addr:python.org': 0.22; 'mind.': 0.24; 'simpler': 0.24; 'cc:2**0': 0.24; 'task': 0.26; 'header:In-Reply- To:1': 0.27; 'message-id:@mail.gmail.com': 0.30; 'url:mailman': 0.30; 'accomplished': 0.31; 'probably': 0.32; 'url:python': 0.33; 'guess': 0.33; 'comment': 0.34; 'subject:from': 0.34; 'received:google.com': 0.35; 'subject:data': 0.36; 'url:listinfo': 0.36; 'thanks': 0.36; 'url:org': 0.36; 'searching': 0.37; 'stable': 0.38; 'pm,': 0.38; 'url:mail': 0.40; 'how': 0.40; 'break': 0.61; "you're": 0.61; 'more': 0.64; 'dear': 0.65; 'wall': 0.65; 'to:addr:gmail.com': 0.65; 'websites': 0.72; ':).': 0.84; 'popped': 0.84; 'safer': 0.84; 'joel': 0.91; 'luck': 0.93; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=VJQiKcY7Y9SUjvAQfezFwNL8+R3DnMUQNLz+qyJvdq8=; b=K0ITn8mKM5iNGzuPweviQr0lI5w51ItBDG1vGFBXDGODRNoPhcUzMUON9Lu7P2z0bv E9I90jgxxJ70MC9mnrQxlNrI1FaIe3dalGx/4J4zJ35mYKP6OKOcnryb6KC2GF/Z8JC4 X3vEzTeIpmeUkKULnPHMD1/6rnkq0+72svYdb/JrQmtVdshXAYQqB78uvN9iUi1wsrXN GFxSxsl5BKT6KRW9KCPpdG2C68ECpmIcDQfwmxvYygL5Iv7dcskk3GZ4wLpFXRzvagjW PNOZZpdMJvT6khoaYolNCvVIJYJukkkupXOushAAFXFy8fwiLp9Gu5eF7CSyj3MKv9In ObeA== MIME-Version: 1.0 X-Received: by 10.52.92.15 with SMTP id ci15mr1603271vdb.34.1377107538306; Wed, 21 Aug 2013 10:52:18 -0700 (PDT) In-Reply-To: <02caf0a8-1506-4746-9136-3452cbdea14b@googlegroups.com> References: <02caf0a8-1506-4746-9136-3452cbdea14b@googlegroups.com> Date: Wed, 21 Aug 2013 13:52:18 -0400 Subject: Re: I wonder if I would be able to collect data from such page using Python From: Joel Goldstick To: Comment Holder Content-Type: text/plain; charset=UTF-8 Cc: "python-list@python.org" X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 22 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1377107547 news.xs4all.nl 15866 [2001:888:2000:d::a6]:53857 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52777 On Wed, Aug 21, 2013 at 1:41 PM, Comment Holder wrote: > Dear Joel, > > Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :). > > Again, thanks a lot & all best// > > -- > http://mail.python.org/mailman/listinfo/python-list You're welcome. One thought popped into my mind. Since the site seems to be from the Wall Street Journal, you may want to look into whether they have an api for searching and retrieving articles. If they do, this would be simpler and probably safer than parsing web pages. From time to time, websites change their layout, which would probably break your program. However APIs are more stable good luck to you -- Joel Goldstick http://joelgoldstick.com