Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #93398 > unrolled thread

Javascript website scraping using WebKit and Selenium tools

Started byVeek M <vek.m1234@gmail.com>
First post2015-07-02 06:41 +0530
Last post2015-07-02 12:01 +0530
Articles 3 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 06:41 +0530
    Re: Javascript website scraping using WebKit and Selenium tools dieter <dieter@handshake.de> - 2015-07-02 07:48 +0200
      Re: Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 12:01 +0530

#93398 — Javascript website scraping using WebKit and Selenium tools

FromVeek M <vek.m1234@gmail.com>
Date2015-07-02 06:41 +0530
SubjectJavascript website scraping using WebKit and Selenium tools
Message-ID<mn231m$n5h$1@dont-email.me>
I tried scraping a javascript website using two tools, both didn't work. The 
website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
text I'm trying to extract is 'GY-68...':

<div class="item3line1">

    <dl class="item " data-id="38952795780">
        <dt class="photo">
            <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5-
c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-
id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
                <img 
src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-
item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ??
BMP085"></img>
            </a>
        </dt>

I'm trying to match the class="item " bit as a preliminary venture:

from pyvirtualdisplay import Display
from selenium import webdriver
import time

display = Display(visible=0, size=(800, 600))
display.start()

browser = webdriver.Firefox()
browser.get('http://xdguo.taobao.com/category-499399872.htm')
print browser.title

time.sleep(120)    
content = browser.find_element_by_class_name('item ')
print content
browser.quit()

display.stop()


I get:
    selenium.common.exceptions.NoSuchElementException: Message: Unable to 
locate element: {"method":"class name","selector":"item "}

I also tried using WebKit - i know the site renders okay in WebKit because i 
tested with rekonq Here, i get the page (in Chinese) but the actual/relevant 
data is not there. WebKit's supposed to run the Javascript and give me the 
final results but I don't think that's happening.

import sys
from io import StringIO
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
from lxml import etree

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://xdguo.taobao.com/category-499399872.htm'
r = Render(url) #returns a Render object
result = r.frame.toHtml() #returns a QString
result_utf8 = result.toUtf8() #returns a QByteArray of utf8 data

#QByteArray->str->unicode
#contents = StringIO(unicode(result_utf8.data(), "utf-8"))
data = result_utf8.data() #returns byte string
print(data)

element = html.fromstring(data)
print(element.tag)

for img in element.xpath('//dl[@class="item "]/dt[@class="photo"]/a/img'):
    print(img.get('alt'))

#archive_links = html.fromstring(str(result.toAscii()))
#print 
archive_links.xpath("/html/body/div[2]/div[3]/div[2]/div[2]/div[1]/div/div
/div/div/div/div[2]/div[2]/dl[1]/dt/a/img")

Basically I want a list of parts the seller has to offer that I can grep, 
sort, uniq. I also tried elinks and lynx with ECMAScript but that was too 
basic and didn't work.

[toc] | [next] | [standalone]


#93412

Fromdieter <dieter@handshake.de>
Date2015-07-02 07:48 +0200
Message-ID<mailman.235.1435816113.3674.python-list@python.org>
In reply to#93398
Veek M <vek.m1234@gmail.com> writes:

> I tried scraping a javascript website using two tools, both didn't work. The 
> website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
> text I'm trying to extract is 'GY-68...':
>
> <div class="item3line1">
>
>     <dl class="item " data-id="38952795780">
>         <dt class="photo">
>             <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5-
> c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-
> id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
>                 <img 
> src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-
> item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ??
> BMP085"></img>
>             </a>
>         </dt>

> ...

When I try to access the link above, I am redirected to a
login page - which, of course, may look very different from what you expect.
You may need to pass on authentication information along with
your request in order to get the page you are expecting.

Note also, that todays sites often heavily use Javascript - which
means that a page only gets the final look when the Javascript
has been executed.


Once the problems to get the "final" HTML code solved,
I would use "lxml" and its "xpath" support to locate any
relevant HTML information.

[toc] | [prev] | [next] | [standalone]


#93414

FromVeek M <vek.m1234@gmail.com>
Date2015-07-02 12:01 +0530
Message-ID<mn2lpu$7fg$1@dont-email.me>
In reply to#93412
dieter wrote:

> Once the problems to get the "final" HTML code solved,
> I would use "lxml" and its "xpath" support to locate any
> relevant HTML information.

Hello Dieter, yes - you are correct. (though I don't think there's any auth 
to browse - nice that you actually tried) He's using jsonP and updating his 
html. I decided to manually mangle it.

urllib to download, re to nuke the jsonp(".........stuff i want......") and 
then lxml. It works and I got the text. Now i need to translate - many 
thanks.

I should have checked first using HTTP Headers to see what he was 
downloading - i'm an ass. Oh well solved :)

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web