Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: "Matt" <matt@centralkaos.com>
Newsgroups: comp.lang.python
Subject: RE: web scraping help / better way to do it ?
Date: Tue, 19 Jan 2016 22:19:48 +1100
Lines: 125
Message-ID: <mailman.108.1453202402.15297.python-list@python.org>
References: <000001d1523b$a76cf9a0$f646ece0$@centralkaos.com> <n7l36e$g73$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
In-Reply-To: <n7l36e$g73$1@ger.gmane.org>
Thread-Index: AQI6mk3phqu7O57QbJwYWhLKxpk/fQInv5iMnh7j4oA=
Content-Language: en-au
Precedence: list
Xref: csiph.com comp.lang.python:101901



> -----Original Message-----
> From: Python-list [mailto:python-list-
> bounces+matt=centralkaos.com@python.org] On Behalf Of Peter Otten
> Sent: Tuesday, 19 January 2016 9:30 PM
> To: python-list@python.org
> Subject: Re: web scraping help / better way to do it ?
> 
> Matt wrote:
> 
> > Beginner python user (3.5) and trying to scrape this page and get the
> > ladder
> > -   www.afl.com.au/ladder .  Its dynamic content so I used lynx -dump to
> > get
> > a  txt file and parsing that.
> >
> > Here is the code
> >
> > # import lynx -dump txt file
> > f = open('c:/temp/afl2.txt','r').read()
> >
> > # Put import txt file into list
> > afl_list = f.split(' ')
> >
> > #here are the things we want to search for search_list = ['FRE',
> > 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC', 'PORT', 'GEEL',
> > 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL']
> >
> > def build_ladder():
> >     for l in search_list:
> >         output_num = afl_list.index(l)
> >         list_pos = output_num -1
> >         ladder_pos = afl_list[list_pos]
> >         print(ladder_pos + ' ' + '-' + ' ' + l)
> >
> > build_ladder()
> >
> >
> > Which outputs this.
> >
> > 1 - FRE
> > 2 - WCE
> > 3 - HAW
> > 4 - SYD
> > 5 - RICH
> > 6 - WB
> > 7 - ADEL
> > 8 - NMFC
> > 9 - PORT
> > 10 - GEEL
> > * - GWS
> > 12 - COLL
> > 13 - MELB
> > 14 - STK
> > 15 - ESS
> > 16 - GCFC
> > 17 - BL
> > 18 - CARL
> >
> > Notice that number 11 is missing because my script picks up "GWS"
> > which is located earlier in the page.  What is the best way to skip
> > that (and get the "GWS" lower down in the txt file) or am I better off
> > approaching the code in a different way?
> 
> If you look at the html source you'll see that the desired "GWS" is inside
a
> table, together with the other abbreviations. To extract (parts of) that
table
> you should use a tool that understands the structure of html.
> 
> The most popular library to parse html with Python is BeautifulSoup, but
my
> example uses lxml:
> 
> $ cat ladder.py
> import urllib.request
> import io
> import lxml.html
> 
> def first(row, xpath):
>     return row.xpath(xpath)[0].strip()
> 
> html = urllib.request.urlopen("http://www.afl.com.au/ladder").read()
> tree = lxml.html.parse(io.BytesIO(html))
> 
> for row in tree.xpath("//tr")[1:]:
>     print(
>         first(row, ".//td[1]/span/text()"),
>         first(row, ".//abbr/text()"))
> 
> $ python3 ladder.py
> 1 FRE
> 2 WCE
> 3 HAW
> 4 SYD
> 5 RICH
> 6 WB
> 7 ADEL
> 8 NMFC
> 9 PORT
> 10 GEEL
> 11 GWS
> 12 COLL
> 13 MELB
> 14 STK
> 15 ESS
> 16 GCFC
> 17 BL
> 18 CARL
> 
> 
> Someone with better knowledge of XPath could probably avoid some of the
> postprocessing I do in Python.
> 
> --
 Thanks Peter, you opened my eyes to a half dozen things here, just what I
needed. 

Much appreciated

Cheers
- Matt