Re: extract HTML table in a structured format

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<arnodel@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.020
X-Spam-Evidence	'H': 0.96; 'S': 0.00; 'wiki': 0.03; 'urllib2': 0.07; 'parsing': 0.09; 'rows': 0.09; 'cc:addr:python-list': 0.11; 'changes': 0.15; '"\|"': 0.16; 'separated': 0.16; 'subject:format': 0.16; 'wrote:': 0.18; '>>>': 0.22; 'import': 0.22; 'cc:addr:python.org': 0.22; 'parse': 0.24; 'cc:2**0': 0.24; 'source': 0.25; 'script': 0.25; 'header:In-Reply-To:1': 0.27; 'wonder': 0.29; 'message-id:@mail.gmail.com': 0.30; 'url:wiki': 0.31; 'extract': 0.31; "skip:' 40": 0.31; 'url:wikipedia': 0.31; 'way?': 0.31; 'figure': 0.32; 'skip:- 30': 0.32; 'table': 0.34; 'could': 0.34; 'info': 0.35; 'skip:u 20': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'format.': 0.36; 'html,': 0.36; 'ubuntu': 0.36; 'hi,': 0.36; 'url:org': 0.36; 'should': 0.36; 'too': 0.37; 'starting': 0.37; 'skip:[ 10': 0.38; 'release': 0.40; 'url:index': 0.63; 'july': 0.63; 'skip:6 10': 0.63; 'to:addr:gmail.com': 0.65; 'bottom': 0.67; 'url:php': 0.85; '2013': 0.98
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=Eu0EHcNROVZ20bbNsNtJk3DvE7AVmP0YjHMuqi2Udmc=; b=pZoqed+8tK1W6Pn+bnXQfM1qAnBSRvqciroPU1+M0RTFs6m1ofSkeAEAnqAChcaWQb TDVh0qGMexVAI3IJqHQnhdnfWtJPw3vNrergZR+48G/7Hq3uSziOSt3YmAHAuegeFKpw vay8siBP8kulYOOcnMYeQPrt2p7xyg+uGczKzp1Txnlo0HuRttYxtJejWO/JwAh4/jBG vMxJHdkuoJYkf5mlnnVpRcSQ1lQdjS+dF/G5q1dOxVykdG2v0gWlh0ZzxwPpSsqOFT/7 T0zTSDy2YQ4dZkgBu8DDW+Jb18YfoGon51bbHmfhkD03SGPdIZSUvUVodEO59dXmoIdL DWzA==
MIME-Version	1.0
X-Received	by 10.112.76.39 with SMTP id h7mr1690501lbw.118.1365617493617; Wed, 10 Apr 2013 11:11:33 -0700 (PDT)
In-Reply-To	<CAOuJsM=u75nv-TxVCpXdcxmfyhxyY0v-NTYPEeGh1MmMuzxCVg@mail.gmail.com>
References	<CAOuJsM=u75nv-TxVCpXdcxmfyhxyY0v-NTYPEeGh1MmMuzxCVg@mail.gmail.com>
Date	Wed, 10 Apr 2013 19:11:33 +0100
Subject	Re: extract HTML table in a structured format
From	Arnaud Delobelle <arnodel@gmail.com>
To	Jabba Laci <jabba.laci@gmail.com>
Content-Type	text/plain; charset=UTF-8
Cc	Python mailing list <python-list@python.org>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.420.1365617501.3114.python-list@python.org> (permalink)
Lines	74
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1365617501 news.xs4all.nl 2653 [2001:888:2000:d::a6]:45233
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:43287

Show key headers only | View raw

On 10 April 2013 09:44, Jabba Laci <jabba.laci@gmail.com> wrote:
> Hi,
>
> I wonder if there is a nice way to extract a whole HTML table and have the
> result in a nice structured format. What I want is to have the lifetime
> table at the bottom of this page:
> http://en.wikipedia.org/wiki/List_of_Ubuntu_releases (then figure out with a
> script until when my Ubuntu release is supported).
>
> I could do it with BeautifulSoup or lxml but is there a better way? There
> should be :)

Instead of parsing HTML, you could just parse the source of the page
(available via action=raw):

------------------------------
import urllib2

url = (
	'http://en.wikipedia.org/w/index.php'
	'?title=List_of_Ubuntu_releases&action=raw'
)

source = urllib2.urlopen(url).read()

# Table rows are separated with the line "|-"
# Then there is a line starting with "|"
potential_rows = source.split("\n|-\n|")

rows = []

for row in potential_rows:
        # Rows in the table start with a link (' [[ ... ]]')
	if row.startswith(" [["):
		row = [item.strip() for item in row.split("\n|")]
		rows.append(row)
------------------------------

>>> import pprint
>>> pprint.pprint(rows)
[['[[Warty Warthog|4.10]]',
  'Warty Warthog',
  '20 October 2004',
  'colspan="2" {{Version |o |30 April 2006}}',
  '2.6.8'],
 ['[[Hoary Hedgehog|5.04]]',
  'Hoary Hedgehog',
  '8 April 2005',
  'colspan="2" {{Version |o |31 October 2006}}',
  '2.6.10'],
 ['[[Breezy Badger|5.10]]',
  'Breezy Badger',
  '13 October 2005',
  'colspan="2" {{Version |o |13 April 2007}}',
  '2.6.12'],
 ['[[Ubuntu 6.06|6.06 LTS]]',
  'Dapper Drake',
  '1 June 2006',
  '{{Version |o | 14 July 2009}}',
  '{{Version |o | 1 June 2011}}',
  '2.6.15'],
 ['[[Ubuntu 6.10|6.10]]',
  'Edgy Eft',
  '26 October 2006',
  'colspan="2" {{Version |o | 25 April 2008}}',
  '2.6.17'],
  [...]
]
>>>

That should give you the info you need (until the wiki page changes too much!)

-- 
Arnaud

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: extract HTML table in a structured format Arnaud Delobelle <arnodel@gmail.com> - 2013-04-10 19:11 +0100

csiph-web