Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #93412
| Path | csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.007 |
| X-Spam-Evidence | '*H*': 0.99; '*S*': 0.00; 'heavily': 0.04; 'redirected': 0.07; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:tools': 0.09; 'subject:using': 0.09; 'executed.': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'subject:Javascript': 0.16; 'which,': 0.16; 'trying': 0.22; 'skip:s 70': 0.22; 'pass': 0.22; 'tried': 0.24; 'header:User- Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; "i'm": 0.29; 'subject:website': 0.29; 'work.': 0.30; 'code': 0.31; 'skip:d 20': 0.32; 'gets': 0.32; 'extract': 0.33; 'to:addr:python-list': 0.35; 'along': 0.35; 'text': 0.36; 'tools,': 0.36; 'two': 0.37; "didn't": 0.37; 'subject:: ': 0.37; 'charset:us-ascii': 0.37; 'skip:i 20': 0.37; 'received:org': 0.38; 'means': 0.39; 'login': 0.39; 'to:addr:python.org': 0.39; 'received:de': 0.40; 'your': 0.60; 'received:217': 0.61; 'information': 0.62; 'above,': 0.63; 'different': 0.64; 'url:htm': 0.73; 'expect.': 0.84; 'scraping': 0.91 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| To | python-list@python.org |
| From | dieter <dieter@handshake.de> |
| Subject | Re: Javascript website scraping using WebKit and Selenium tools |
| Date | Thu, 02 Jul 2015 07:48:20 +0200 |
| References | <mn231m$n5h$1@dont-email.me> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=us-ascii |
| X-Gmane-NNTP-Posting-Host | pd9e09072.dip0.t-ipconnect.de |
| User-Agent | Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (linux) |
| Cancel-Lock | sha1:ArVVSpdMwk2Pucujqo5RIDIndxE= |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.235.1435816113.3674.python-list@python.org> (permalink) |
| Lines | 36 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1435816113 news.xs4all.nl 2900 [2001:888:2000:d::a6]:47701 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:93412 |
Show key headers only | View raw
Veek M <vek.m1234@gmail.com> writes: > I tried scraping a javascript website using two tools, both didn't work. The > website link is: http://xdguo.taobao.com/category-499399872.htm The relevant > text I'm trying to extract is 'GY-68...': > > <div class="item3line1"> > > <dl class="item " data-id="38952795780"> > <dt class="photo"> > <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5- > c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module- > id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11"> > <img > src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0- > item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ?? > BMP085"></img> > </a> > </dt> > ... When I try to access the link above, I am redirected to a login page - which, of course, may look very different from what you expect. You may need to pass on authentication information along with your request in order to get the page you are expecting. Note also, that todays sites often heavily use Javascript - which means that a page only gets the final look when the Javascript has been executed. Once the problems to get the "final" HTML code solved, I would use "lxml" and its "xpath" support to locate any relevant HTML information.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 06:41 +0530
Re: Javascript website scraping using WebKit and Selenium tools dieter <dieter@handshake.de> - 2015-07-02 07:48 +0200
Re: Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 12:01 +0530
csiph-web