Path: csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail From: "Matt" Newsgroups: comp.lang.python Subject: RE: web scraping help / better way to do it ? Date: Tue, 19 Jan 2016 22:19:48 +1100 Lines: 125 Message-ID: References: <000001d1523b$a76cf9a0$f646ece0$@centralkaos.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de 4gqcJXdK/DkLDfGb62lZeQlT4HbiiCUH4NFyUbUPRGzg== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'skip:[ 20': 0.03; 'python3': 0.05; 'file)': 0.07; 'subject:help': 0.07; "'-'": 0.09; 'from:addr:matt': 0.09; 'of)': 0.09; 'python': 0.10; 'python.': 0.11; 'def': 0.13; 'message-----': 0.15; '2016': 0.16; 'outputs': 0.16; 'picks': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'row': 0.16; 'scrape': 0.16; 'wrote:': 0.16; 'beginner': 0.18; 'skip': 0.18; 'skip:l 30': 0.18; 'skip:u 30': 0.18; 'tree': 0.18; 'url:au': 0.18; 'library': 0.20; 'parse': 0.22; 'parsing': 0.22; 'txt': 0.22; 'trying': 0.22; 'needed.': 0.23; 'import': 0.24; 'cheers': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; 'example': 0.26; 'skip:" 20': 0.26; 'earlier': 0.27; 'subject: ?': 0.27; 'appreciated': 0.27; 'page.': 0.28; 'this.': 0.28; 'cat': 0.29; 'table,': 0.29; 'way?': 0.29; 'skip:b 40': 0.29; 'that.': 0.30; 'code': 0.30; 'probably': 0.31; 'table': 0.32; 'source': 0.33; 'extract': 0.33; 'received:com.au': 0.33; 'skip:- 10': 0.34; 'structure': 0.34; 'file': 0.34; 'list': 0.34; 'sent:': 0.35; 'could': 0.35; 'knowledge': 0.35; 'subject:': 0.35; 'but': 0.36; 'should': 0.36; 'located': 0.36; 'tool': 0.36; '(and': 0.36; 'email addr:python.org': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'thanks': 0.37; 'desired': 0.37; 'missing': 0.37; 'charset:us-ascii': 0.37; 'things': 0.38; 'january': 0.38; 'someone': 0.38; 'from:': 0.39; 'received:192': 0.39; 'to:addr:python.org': 0.40; 'some': 0.40; "you'll": 0.61; 'skip:u 10': 0.61; 'avoid': 0.61; 'subject: / ': 0.63; 'different': 0.63; 'python-list': 0.66; 'here': 0.66; 'email name:python-list': 0.67; 'eyes': 0.70; 'received:203': 0.74; 'otten': 0.84; 'peter,': 0.84; 'dozen': 0.91; 'scraping': 0.91 In-Reply-To: X-Mailer: Microsoft Outlook 15.0 Thread-Index: AQI6mk3phqu7O57QbJwYWhLKxpk/fQInv5iMnh7j4oA= Content-Language: en-au X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:101901 > -----Original Message----- > From: Python-list [mailto:python-list- > bounces+matt=centralkaos.com@python.org] On Behalf Of Peter Otten > Sent: Tuesday, 19 January 2016 9:30 PM > To: python-list@python.org > Subject: Re: web scraping help / better way to do it ? > > Matt wrote: > > > Beginner python user (3.5) and trying to scrape this page and get the > > ladder > > - www.afl.com.au/ladder . Its dynamic content so I used lynx -dump to > > get > > a txt file and parsing that. > > > > Here is the code > > > > # import lynx -dump txt file > > f = open('c:/temp/afl2.txt','r').read() > > > > # Put import txt file into list > > afl_list = f.split(' ') > > > > #here are the things we want to search for search_list = ['FRE', > > 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC', 'PORT', 'GEEL', > > 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL'] > > > > def build_ladder(): > > for l in search_list: > > output_num = afl_list.index(l) > > list_pos = output_num -1 > > ladder_pos = afl_list[list_pos] > > print(ladder_pos + ' ' + '-' + ' ' + l) > > > > build_ladder() > > > > > > Which outputs this. > > > > 1 - FRE > > 2 - WCE > > 3 - HAW > > 4 - SYD > > 5 - RICH > > 6 - WB > > 7 - ADEL > > 8 - NMFC > > 9 - PORT > > 10 - GEEL > > * - GWS > > 12 - COLL > > 13 - MELB > > 14 - STK > > 15 - ESS > > 16 - GCFC > > 17 - BL > > 18 - CARL > > > > Notice that number 11 is missing because my script picks up "GWS" > > which is located earlier in the page. What is the best way to skip > > that (and get the "GWS" lower down in the txt file) or am I better off > > approaching the code in a different way? > > If you look at the html source you'll see that the desired "GWS" is inside a > table, together with the other abbreviations. To extract (parts of) that table > you should use a tool that understands the structure of html. > > The most popular library to parse html with Python is BeautifulSoup, but my > example uses lxml: > > $ cat ladder.py > import urllib.request > import io > import lxml.html > > def first(row, xpath): > return row.xpath(xpath)[0].strip() > > html = urllib.request.urlopen("http://www.afl.com.au/ladder").read() > tree = lxml.html.parse(io.BytesIO(html)) > > for row in tree.xpath("//tr")[1:]: > print( > first(row, ".//td[1]/span/text()"), > first(row, ".//abbr/text()")) > > $ python3 ladder.py > 1 FRE > 2 WCE > 3 HAW > 4 SYD > 5 RICH > 6 WB > 7 ADEL > 8 NMFC > 9 PORT > 10 GEEL > 11 GWS > 12 COLL > 13 MELB > 14 STK > 15 ESS > 16 GCFC > 17 BL > 18 CARL > > > Someone with better knowledge of XPath could probably avoid some of the > postprocessing I do in Python. > > -- Thanks Peter, you opened my eyes to a half dozen things here, just what I needed. Much appreciated Cheers - Matt