Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.85.MISMATCH!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.059 X-Spam-Evidence: '*H*': 0.88; '*S*': 0.00; 'parser': 0.05; '"""': 0.07; 'cc:addr:python-list': 0.15; 'document,': 0.16; 'dom,': 0.16; 'nodes': 0.16; 'read()': 0.16; 'simplified': 0.16; 'wrote:': 0.16; 'jan': 0.19; 'trying': 0.20; 'header:In-Reply-To:1': 0.22; 'cc:2**0': 0.25; 'code': 0.25; 'modify': 0.25; 'sat,': 0.25; 'pm,': 0.26; 'import': 0.27; 'script': 0.28; 'message- id:@mail.gmail.com': 0.28; 'elements': 0.28; 'problem': 0.29; 'cc:addr:python.org': 0.29; '(and': 0.29; 'concerned,': 0.30; 'dom': 0.30; 'received:209.85.210.46': 0.30; 'received:mail- pz0-f46.google.com': 0.30; 'subject:?': 0.30; '\xa0\xa0\xa0': 0.30; 'does': 0.32; 'rest': 0.33; 'match': 0.33; 'option.': 0.34; 'parse': 0.34; 'probably': 0.35; 'but': 0.37; 'received:google.com': 0.37; 'subject:can': 0.37; 'could': 0.37; 'received:209.85': 0.38; 'should': 0.38; 'version:': 0.38; 'why': 0.39; 'tool': 0.39; 'received:209': 0.39; 'subject:: ': 0.39; 'did': 0.39; 'worth': 0.61; 'url:v': 0.61; 'virus: References: Date: Sun, 15 Jan 2012 15:03:18 -0800 Subject: Re: why i can get nothing? From: Robert Helmer To: contro opinion Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 38 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1326668607 news.xs4all.nl 6907 [2001:888:2000:d::a6]:41334 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:19019 On Sat, Jan 14, 2012 at 7:54 PM, contro opinion wr= ote: > here is my code : > import urllib > import lxml.html > down=3D'http://download.v.163.com/dl/open/00DL0QDR0QDS0QHH.html' > file=3Durllib.urlopen(down). > read() > root=3Dlxml.html.document_fromstring(file) > tnodes =3D root.xpath("//a/@href[contains(string(),'mp4')]") > for i,add in enumerate(tnodes): > =A0=A0=A0 print=A0 i,add > > why i can get nothing? The problem is the document. The links you are trying to match on are inside the script tags in the document, here's a simplified version: """ """ So the anchor elements are not part of the DOM as far as lxml is concerned, lxml does not know how to parse javascript (and even if it did it would have to execute the JS, and JS would have to modify the DOM, before you could get this via xpath) You could have lxml return just the script nodes that contain the text you care about: tnodes =3D root.xpath("//script[contains(.,'mp4')]") Then you will need a different tool for the rest of this, regex is not perfect but should be good enough. Probably not worth the effort to use a real javascript parser if you're just trying to scrape the mp4 links out of this, but it's an option.