Groups > comp.lang.python > #101901

RE: web scraping help / better way to do it ?

Path	csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From	"Matt" <matt@centralkaos.com>
Newsgroups	comp.lang.python
Subject	RE: web scraping help / better way to do it ?
Date	Tue, 19 Jan 2016 22:19:48 +1100
Lines	125
Message-ID	<mailman.108.1453202402.15297.python-list@python.org> (permalink)
References	<000001d1523b$a76cf9a0$f646ece0$@centralkaos.com> <n7l36e$g73$1@ger.gmane.org>
Mime-Version	1.0
Content-Type	text/plain; charset="us-ascii"
Content-Transfer-Encoding	7bit
X-Trace	news.uni-berlin.de 4gqcJXdK/DkLDfGb62lZeQlT4HbiiCUH4NFyUbUPRGzg==
Return-Path	<matt@centralkaos.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.002
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'skip:[ 20': 0.03; 'python3': 0.05; 'file)': 0.07; 'subject:help': 0.07; "'-'": 0.09; 'from:addr:matt': 0.09; 'of)': 0.09; 'python': 0.10; 'python.': 0.11; 'def': 0.13; 'message-----': 0.15; '2016': 0.16; 'outputs': 0.16; 'picks': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'row': 0.16; 'scrape': 0.16; 'wrote:': 0.16; 'beginner': 0.18; 'skip': 0.18; 'skip:l 30': 0.18; 'skip:u 30': 0.18; 'tree': 0.18; 'url:au': 0.18; 'library': 0.20; 'parse': 0.22; 'parsing': 0.22; 'txt': 0.22; 'trying': 0.22; 'needed.': 0.23; 'import': 0.24; 'cheers': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; 'example': 0.26; 'skip:" 20': 0.26; 'earlier': 0.27; 'subject: ?': 0.27; 'appreciated': 0.27; 'page.': 0.28; 'this.': 0.28; 'cat': 0.29; 'table,': 0.29; 'way?': 0.29; 'skip:b 40': 0.29; 'that.': 0.30; 'code': 0.30; 'probably': 0.31; 'table': 0.32; 'source': 0.33; 'extract': 0.33; 'received:com.au': 0.33; 'skip:- 10': 0.34; 'structure': 0.34; 'file': 0.34; 'list': 0.34; 'sent:': 0.35; 'could': 0.35; 'knowledge': 0.35; 'subject:': 0.35; 'but': 0.36; 'should': 0.36; 'located': 0.36; 'tool': 0.36; '(and': 0.36; 'email addr:python.org': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'desired': 0.37; 'missing': 0.37; 'charset:us-ascii': 0.37; 'things': 0.38; 'january': 0.38; 'someone': 0.38; 'from:': 0.39; 'received:192': 0.39; 'to:addr:python.org': 0.40; 'some': 0.40; "you'll": 0.61; 'skip:u 10': 0.61; 'avoid': 0.61; 'subject: / ': 0.63; 'different': 0.63; 'python-list': 0.66; 'here': 0.66; 'email name:python-list': 0.67; 'eyes': 0.70; 'received:203': 0.74; 'otten': 0.84; 'peter,': 0.84; 'dozen': 0.91; 'scraping': 0.91
In-Reply-To	<n7l36e$g73$1@ger.gmane.org>
X-Mailer	Microsoft Outlook 15.0
Thread-Index	AQI6mk3phqu7O57QbJwYWhLKxpk/fQInv5iMnh7j4oA=
Content-Language	en-au
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.20+
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref	csiph.com comp.lang.python:101901

Show key headers only | View raw


> -----Original Message-----
> From: Python-list [mailto:python-list-
> bounces+matt=centralkaos.com@python.org] On Behalf Of Peter Otten
> Sent: Tuesday, 19 January 2016 9:30 PM
> To: python-list@python.org
> Subject: Re: web scraping help / better way to do it ?
> 
> Matt wrote:
> 
> > Beginner python user (3.5) and trying to scrape this page and get the
> > ladder
> > -   www.afl.com.au/ladder .  Its dynamic content so I used lynx -dump to
> > get
> > a  txt file and parsing that.
> >
> > Here is the code
> >
> > # import lynx -dump txt file
> > f = open('c:/temp/afl2.txt','r').read()
> >
> > # Put import txt file into list
> > afl_list = f.split(' ')
> >
> > #here are the things we want to search for search_list = ['FRE',
> > 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC', 'PORT', 'GEEL',
> > 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL']
> >
> > def build_ladder():
> >     for l in search_list:
> >         output_num = afl_list.index(l)
> >         list_pos = output_num -1
> >         ladder_pos = afl_list[list_pos]
> >         print(ladder_pos + ' ' + '-' + ' ' + l)
> >
> > build_ladder()
> >
> >
> > Which outputs this.
> >
> > 1 - FRE
> > 2 - WCE
> > 3 - HAW
> > 4 - SYD
> > 5 - RICH
> > 6 - WB
> > 7 - ADEL
> > 8 - NMFC
> > 9 - PORT
> > 10 - GEEL
> > * - GWS
> > 12 - COLL
> > 13 - MELB
> > 14 - STK
> > 15 - ESS
> > 16 - GCFC
> > 17 - BL
> > 18 - CARL
> >
> > Notice that number 11 is missing because my script picks up "GWS"
> > which is located earlier in the page.  What is the best way to skip
> > that (and get the "GWS" lower down in the txt file) or am I better off
> > approaching the code in a different way?
> 
> If you look at the html source you'll see that the desired "GWS" is inside
a
> table, together with the other abbreviations. To extract (parts of) that
table
> you should use a tool that understands the structure of html.
> 
> The most popular library to parse html with Python is BeautifulSoup, but
my
> example uses lxml:
> 
> $ cat ladder.py
> import urllib.request
> import io
> import lxml.html
> 
> def first(row, xpath):
>     return row.xpath(xpath)[0].strip()
> 
> html = urllib.request.urlopen("http://www.afl.com.au/ladder").read()
> tree = lxml.html.parse(io.BytesIO(html))
> 
> for row in tree.xpath("//tr")[1:]:
>     print(
>         first(row, ".//td[1]/span/text()"),
>         first(row, ".//abbr/text()"))
> 
> $ python3 ladder.py
> 1 FRE
> 2 WCE
> 3 HAW
> 4 SYD
> 5 RICH
> 6 WB
> 7 ADEL
> 8 NMFC
> 9 PORT
> 10 GEEL
> 11 GWS
> 12 COLL
> 13 MELB
> 14 STK
> 15 ESS
> 16 GCFC
> 17 BL
> 18 CARL
> 
> 
> Someone with better knowledge of XPath could probably avoid some of the
> postprocessing I do in Python.
> 
> --
 Thanks Peter, you opened my eyes to a half dozen things here, just what I
needed. 

Much appreciated

Cheers
- Matt

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

RE: web scraping help / better way to do it ? "Matt" <matt@centralkaos.com> - 2016-01-19 22:19 +1100

csiph-web