Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.020 X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'wiki': 0.03; 'urllib2': 0.07; 'parsing': 0.09; 'rows': 0.09; 'cc:addr:python-list': 0.11; 'changes': 0.15; '"|"': 0.16; 'separated': 0.16; 'subject:format': 0.16; 'wrote:': 0.18; '>>>': 0.22; 'import': 0.22; 'cc:addr:python.org': 0.22; 'parse': 0.24; 'cc:2**0': 0.24; 'source': 0.25; 'script': 0.25; 'header:In-Reply-To:1': 0.27; 'wonder': 0.29; 'message-id:@mail.gmail.com': 0.30; 'url:wiki': 0.31; 'extract': 0.31; "skip:' 40": 0.31; 'url:wikipedia': 0.31; 'way?': 0.31; 'figure': 0.32; 'skip:- 30': 0.32; 'table': 0.34; 'could': 0.34; 'info': 0.35; 'skip:u 20': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'format.': 0.36; 'html,': 0.36; 'ubuntu': 0.36; 'hi,': 0.36; 'url:org': 0.36; 'should': 0.36; 'too': 0.37; 'starting': 0.37; 'skip:[ 10': 0.38; 'release': 0.40; 'url:index': 0.63; 'july': 0.63; 'skip:6 10': 0.63; 'to:addr:gmail.com': 0.65; 'bottom': 0.67; 'url:php': 0.85; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=Eu0EHcNROVZ20bbNsNtJk3DvE7AVmP0YjHMuqi2Udmc=; b=pZoqed+8tK1W6Pn+bnXQfM1qAnBSRvqciroPU1+M0RTFs6m1ofSkeAEAnqAChcaWQb TDVh0qGMexVAI3IJqHQnhdnfWtJPw3vNrergZR+48G/7Hq3uSziOSt3YmAHAuegeFKpw vay8siBP8kulYOOcnMYeQPrt2p7xyg+uGczKzp1Txnlo0HuRttYxtJejWO/JwAh4/jBG vMxJHdkuoJYkf5mlnnVpRcSQ1lQdjS+dF/G5q1dOxVykdG2v0gWlh0ZzxwPpSsqOFT/7 T0zTSDy2YQ4dZkgBu8DDW+Jb18YfoGon51bbHmfhkD03SGPdIZSUvUVodEO59dXmoIdL DWzA== MIME-Version: 1.0 X-Received: by 10.112.76.39 with SMTP id h7mr1690501lbw.118.1365617493617; Wed, 10 Apr 2013 11:11:33 -0700 (PDT) In-Reply-To: References: Date: Wed, 10 Apr 2013 19:11:33 +0100 Subject: Re: extract HTML table in a structured format From: Arnaud Delobelle To: Jabba Laci Content-Type: text/plain; charset=UTF-8 Cc: Python mailing list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 74 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1365617501 news.xs4all.nl 2653 [2001:888:2000:d::a6]:45233 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:43287 On 10 April 2013 09:44, Jabba Laci wrote: > Hi, > > I wonder if there is a nice way to extract a whole HTML table and have the > result in a nice structured format. What I want is to have the lifetime > table at the bottom of this page: > http://en.wikipedia.org/wiki/List_of_Ubuntu_releases (then figure out with a > script until when my Ubuntu release is supported). > > I could do it with BeautifulSoup or lxml but is there a better way? There > should be :) Instead of parsing HTML, you could just parse the source of the page (available via action=raw): ------------------------------ import urllib2 url = ( 'http://en.wikipedia.org/w/index.php' '?title=List_of_Ubuntu_releases&action=raw' ) source = urllib2.urlopen(url).read() # Table rows are separated with the line "|-" # Then there is a line starting with "|" potential_rows = source.split("\n|-\n|") rows = [] for row in potential_rows: # Rows in the table start with a link (' [[ ... ]]') if row.startswith(" [["): row = [item.strip() for item in row.split("\n|")] rows.append(row) ------------------------------ >>> import pprint >>> pprint.pprint(rows) [['[[Warty Warthog|4.10]]', 'Warty Warthog', '20 October 2004', 'colspan="2" {{Version |o |30 April 2006}}', '2.6.8'], ['[[Hoary Hedgehog|5.04]]', 'Hoary Hedgehog', '8 April 2005', 'colspan="2" {{Version |o |31 October 2006}}', '2.6.10'], ['[[Breezy Badger|5.10]]', 'Breezy Badger', '13 October 2005', 'colspan="2" {{Version |o |13 April 2007}}', '2.6.12'], ['[[Ubuntu 6.06|6.06 LTS]]', 'Dapper Drake', '1 June 2006', '{{Version |o | 14 July 2009}}', '{{Version |o | 1 June 2011}}', '2.6.15'], ['[[Ubuntu 6.10|6.10]]', 'Edgy Eft', '26 October 2006', 'colspan="2" {{Version |o | 25 April 2008}}', '2.6.17'], [...] ] >>> That should give you the info you need (until the wiki page changes too much!) -- Arnaud