Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #93412

Re: Javascript website scraping using WebKit and Selenium tools

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.007
X-Spam-Evidence '*H*': 0.99; '*S*': 0.00; 'heavily': 0.04; 'redirected': 0.07; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:tools': 0.09; 'subject:using': 0.09; 'executed.': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'subject:Javascript': 0.16; 'which,': 0.16; 'trying': 0.22; 'skip:s 70': 0.22; 'pass': 0.22; 'tried': 0.24; 'header:User- Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; "i'm": 0.29; 'subject:website': 0.29; 'work.': 0.30; 'code': 0.31; 'skip:d 20': 0.32; 'gets': 0.32; 'extract': 0.33; 'to:addr:python-list': 0.35; 'along': 0.35; 'text': 0.36; 'tools,': 0.36; 'two': 0.37; "didn't": 0.37; 'subject:: ': 0.37; 'charset:us-ascii': 0.37; 'skip:i 20': 0.37; 'received:org': 0.38; 'means': 0.39; 'login': 0.39; 'to:addr:python.org': 0.39; 'received:de': 0.40; 'your': 0.60; 'received:217': 0.61; 'information': 0.62; 'above,': 0.63; 'different': 0.64; 'url:htm': 0.73; 'expect.': 0.84; 'scraping': 0.91
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From dieter <dieter@handshake.de>
Subject Re: Javascript website scraping using WebKit and Selenium tools
Date Thu, 02 Jul 2015 07:48:20 +0200
References <mn231m$n5h$1@dont-email.me>
Mime-Version 1.0
Content-Type text/plain; charset=us-ascii
X-Gmane-NNTP-Posting-Host pd9e09072.dip0.t-ipconnect.de
User-Agent Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.22 (linux)
Cancel-Lock sha1:ArVVSpdMwk2Pucujqo5RIDIndxE=
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.235.1435816113.3674.python-list@python.org> (permalink)
Lines 36
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1435816113 news.xs4all.nl 2900 [2001:888:2000:d::a6]:47701
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:93412

Show key headers only | View raw


Veek M <vek.m1234@gmail.com> writes:

> I tried scraping a javascript website using two tools, both didn't work. The 
> website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
> text I'm trying to extract is 'GY-68...':
>
> <div class="item3line1">
>
>     <dl class="item " data-id="38952795780">
>         <dt class="photo">
>             <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5-
> c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-
> id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
>                 <img 
> src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-
> item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ??
> BMP085"></img>
>             </a>
>         </dt>

> ...

When I try to access the link above, I am redirected to a
login page - which, of course, may look very different from what you expect.
You may need to pass on authentication information along with
your request in order to get the page you are expecting.

Note also, that todays sites often heavily use Javascript - which
means that a page only gets the final look when the Javascript
has been executed.


Once the problems to get the "final" HTML code solved,
I would use "lxml" and its "xpath" support to locate any
relevant HTML information.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 06:41 +0530
  Re: Javascript website scraping using WebKit and Selenium tools dieter <dieter@handshake.de> - 2015-07-02 07:48 +0200
    Re: Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 12:01 +0530

csiph-web