Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!npeer01.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: Jon Clements Newsgroups: comp.lang.python Subject: Re: Fetching data from a HTML file Date: Fri, 23 Mar 2012 22:12:46 -0700 (PDT) Organization: http://groups.google.com Lines: 48 Message-ID: <18618102.2255.1332565966684.JavaMail.geo-discussion-forums@vbtv42> References: <9362386.1094.1332510725414.JavaMail.geo-discussion-forums@ynlt15> NNTP-Posting-Host: 86.156.91.130 Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1332565966 6165 127.0.0.1 (24 Mar 2012 05:12:46 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Sat, 24 Mar 2012 05:12:46 +0000 (UTC) In-Reply-To: <9362386.1094.1332510725414.JavaMail.geo-discussion-forums@ynlt15> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=86.156.91.130; posting-account=HLD_OAoAAAD-0RilNRZUjdKEwXt97Q9q User-Agent: G2/1.0 X-Received-Bytes: 2774 Xref: csiph.com comp.lang.python:22113 On Friday, 23 March 2012 13:52:05 UTC, Sangeet wrote: > Hi, >=20 > I've got to fetch data from the snippet below and have been trying to mat= ch the digits in this to specifically to specific groups. But I can't seem = to figure how to go about stripping the tags! :( >=20 > Sum2451102561.496 [m= in] > >=20 > Actually, I'm working on ROBOT Framework, and haven't been able to figure= out how to read data from HTML tables. Reading from the source, is the bes= t (read rudimentary) way I could come up with. Any suggestions are welcome! >=20 > Thanks, > Sangeet I would personally use lxml - a quick example: # -*- coding: utf-8 -*- import lxml.html text =3D """ Sum=E2=80=8B2451102561= .496 [min] """ table =3D lxml.html.fromstring(text) for tr in table.xpath('//tr'): print [ (el.get('class', ''), el.text_content()) for el in tr.iterfind(= 'td') ] [('', 'Sum'), ('', ''), ('green', '245'), ('red', '11'), ('', '0'), ('', '2= 56'), ('', '1.496 [min]')] It does a reasonable job, but if it doesn't work quite right, then there's = a .fromstring(parser=3D...) option, and you should be able to pass in Eleme= ntSoup and try your luck from there.=20 hth, Jon.