Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #93398 > unrolled thread
| Started by | Veek M <vek.m1234@gmail.com> |
|---|---|
| First post | 2015-07-02 06:41 +0530 |
| Last post | 2015-07-02 12:01 +0530 |
| Articles | 3 — 2 participants |
Back to article view | Back to comp.lang.python
Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 06:41 +0530
Re: Javascript website scraping using WebKit and Selenium tools dieter <dieter@handshake.de> - 2015-07-02 07:48 +0200
Re: Javascript website scraping using WebKit and Selenium tools Veek M <vek.m1234@gmail.com> - 2015-07-02 12:01 +0530
| From | Veek M <vek.m1234@gmail.com> |
|---|---|
| Date | 2015-07-02 06:41 +0530 |
| Subject | Javascript website scraping using WebKit and Selenium tools |
| Message-ID | <mn231m$n5h$1@dont-email.me> |
I tried scraping a javascript website using two tools, both didn't work. The
website link is: http://xdguo.taobao.com/category-499399872.htm The relevant
text I'm trying to extract is 'GY-68...':
<div class="item3line1">
<dl class="item " data-id="38952795780">
<dt class="photo">
<a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5-
c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-
id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
<img
src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-
item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ??
BMP085"></img>
</a>
</dt>
I'm trying to match the class="item " bit as a preliminary venture:
from pyvirtualdisplay import Display
from selenium import webdriver
import time
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
browser.get('http://xdguo.taobao.com/category-499399872.htm')
print browser.title
time.sleep(120)
content = browser.find_element_by_class_name('item ')
print content
browser.quit()
display.stop()
I get:
selenium.common.exceptions.NoSuchElementException: Message: Unable to
locate element: {"method":"class name","selector":"item "}
I also tried using WebKit - i know the site renders okay in WebKit because i
tested with rekonq Here, i get the page (in Chinese) but the actual/relevant
data is not there. WebKit's supposed to run the Javascript and give me the
final results but I don't think that's happening.
import sys
from io import StringIO
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
from lxml import etree
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://xdguo.taobao.com/category-499399872.htm'
r = Render(url) #returns a Render object
result = r.frame.toHtml() #returns a QString
result_utf8 = result.toUtf8() #returns a QByteArray of utf8 data
#QByteArray->str->unicode
#contents = StringIO(unicode(result_utf8.data(), "utf-8"))
data = result_utf8.data() #returns byte string
print(data)
element = html.fromstring(data)
print(element.tag)
for img in element.xpath('//dl[@class="item "]/dt[@class="photo"]/a/img'):
print(img.get('alt'))
#archive_links = html.fromstring(str(result.toAscii()))
#print
archive_links.xpath("/html/body/div[2]/div[3]/div[2]/div[2]/div[1]/div/div
/div/div/div/div[2]/div[2]/dl[1]/dt/a/img")
Basically I want a list of parts the seller has to offer that I can grep,
sort, uniq. I also tried elinks and lynx with ECMAScript but that was too
basic and didn't work.
[toc] | [next] | [standalone]
| From | dieter <dieter@handshake.de> |
|---|---|
| Date | 2015-07-02 07:48 +0200 |
| Message-ID | <mailman.235.1435816113.3674.python-list@python.org> |
| In reply to | #93398 |
Veek M <vek.m1234@gmail.com> writes: > I tried scraping a javascript website using two tools, both didn't work. The > website link is: http://xdguo.taobao.com/category-499399872.htm The relevant > text I'm trying to extract is 'GY-68...': > > <div class="item3line1"> > > <dl class="item " data-id="38952795780"> > <dt class="photo"> > <a target="_blank" href="//item.taobao.com/item.htm?spm=a1z10.5- > c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module- > id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11"> > <img > src="//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0- > item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 ?? BOSCH?? ??????? ?? > BMP085"></img> > </a> > </dt> > ... When I try to access the link above, I am redirected to a login page - which, of course, may look very different from what you expect. You may need to pass on authentication information along with your request in order to get the page you are expecting. Note also, that todays sites often heavily use Javascript - which means that a page only gets the final look when the Javascript has been executed. Once the problems to get the "final" HTML code solved, I would use "lxml" and its "xpath" support to locate any relevant HTML information.
[toc] | [prev] | [next] | [standalone]
| From | Veek M <vek.m1234@gmail.com> |
|---|---|
| Date | 2015-07-02 12:01 +0530 |
| Message-ID | <mn2lpu$7fg$1@dont-email.me> |
| In reply to | #93412 |
dieter wrote:
> Once the problems to get the "final" HTML code solved,
> I would use "lxml" and its "xpath" support to locate any
> relevant HTML information.
Hello Dieter, yes - you are correct. (though I don't think there's any auth
to browse - nice that you actually tried) He's using jsonP and updating his
html. I decided to manually mangle it.
urllib to download, re to nuke the jsonp(".........stuff i want......") and
then lxml. It works and I got the text. Now i need to translate - many
thanks.
I should have checked first using HTTP Headers to see what he was
downloading - i'm an ass. Oh well solved :)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web