Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #19019

Re: why i can get nothing?

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.85.MISMATCH!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <robert@roberthelmer.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.059
X-Spam-Evidence '*H*': 0.88; '*S*': 0.00; 'parser': 0.05; '"""': 0.07; 'cc:addr:python-list': 0.15; 'document,': 0.16; 'dom,': 0.16; 'nodes': 0.16; 'read()': 0.16; 'simplified': 0.16; 'wrote:': 0.16; 'jan': 0.19; 'trying': 0.20; 'header:In-Reply-To:1': 0.22; 'cc:2**0': 0.25; 'code': 0.25; 'modify': 0.25; 'sat,': 0.25; 'pm,': 0.26; 'import': 0.27; 'script': 0.28; 'message- id:@mail.gmail.com': 0.28; 'elements': 0.28; 'problem': 0.29; 'cc:addr:python.org': 0.29; '(and': 0.29; 'concerned,': 0.30; 'dom': 0.30; 'received:209.85.210.46': 0.30; 'received:mail- pz0-f46.google.com': 0.30; 'subject:?': 0.30; '\xa0\xa0\xa0': 0.30; 'does': 0.32; 'rest': 0.33; 'match': 0.33; 'option.': 0.34; 'parse': 0.34; 'probably': 0.35; 'but': 0.37; 'received:google.com': 0.37; 'subject:can': 0.37; 'could': 0.37; 'received:209.85': 0.38; 'should': 0.38; 'version:': 0.38; 'why': 0.39; 'tool': 0.39; 'received:209': 0.39; 'subject:: ': 0.39; 'did': 0.39; 'worth': 0.61; 'url:v': 0.61; 'virus:</script': 0.61; 'virus:<script': 0.61; 'perfect': 0.64; 'here': 0.64; 'care': 0.70; 'anchor': 0.84; 'mp4': 0.84; 'subject:nothing': 0.84; 'url:open': 0.96
MIME-Version 1.0
In-Reply-To <CA+YdQ_76P-UJmEiD7dySoGXxwjFx_Tm-uww8KYJ5Dq0-zDSM1Q@mail.gmail.com>
References <CA+YdQ_76P-UJmEiD7dySoGXxwjFx_Tm-uww8KYJ5Dq0-zDSM1Q@mail.gmail.com>
Date Sun, 15 Jan 2012 15:03:18 -0800
Subject Re: why i can get nothing?
From Robert Helmer <robert@roberthelmer.com>
To contro opinion <contropinion@gmail.com>
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
Cc python-list <python-list@python.org>
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.4780.1326668607.27778.python-list@python.org> (permalink)
Lines 38
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1326668607 news.xs4all.nl 6907 [2001:888:2000:d::a6]:41334
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:19019

Show key headers only | View raw


On Sat, Jan 14, 2012 at 7:54 PM, contro opinion <contropinion@gmail.com> wrote:
> here is my code :
> import urllib
> import lxml.html
> down='http://download.v.163.com/dl/open/00DL0QDR0QDS0QHH.html'
> file=urllib.urlopen(down).
> read()
> root=lxml.html.document_fromstring(file)
> tnodes = root.xpath("//a/@href[contains(string(),'mp4')]")
> for i,add in enumerate(tnodes):
>     print  i,add
>
> why i can get nothing?


The problem is the document. The links you are trying to match on are
inside the script tags in the document, here's a simplified version:

"""
<script>
  obj="<a href='blah.mp4'>";
</script>
"""

So the anchor elements are not part of the DOM as far as lxml is
concerned, lxml does not know how to parse javascript (and even if it
did it would have to execute the JS, and JS would have to modify the
DOM, before you could get this via xpath)

You could have lxml return just the script nodes that contain the text
you care about:
tnodes = root.xpath("//script[contains(.,'mp4')]")

Then you will need a different tool for the rest of this, regex is not
perfect but should be good enough. Probably not worth the effort to
use a real javascript parser if you're just trying to scrape the mp4
links out of this, but it's an option.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: why i can get nothing? Robert Helmer <robert@roberthelmer.com> - 2012-01-15 15:03 -0800

csiph-web