Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #107942 > unrolled thread

Python3 html scraper that supports javascript

Started byzljubisic@gmail.com
First post2016-05-01 07:19 -0700
Last post2016-05-02 13:11 -0700
Articles 6 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-01 07:19 -0700
    Re: Python3 html scraper that supports javascript Bob Gailer <bgailer@gmail.com> - 2016-05-01 13:01 -0400
      Re: Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-02 08:33 -0700
        Re: Python3 html scraper that supports javascript DFS <nospam@dfs.com> - 2016-05-02 12:39 -0400
        Re: Python3 html scraper that supports javascript Stephen Hansen <me+python@ixokai.io> - 2016-05-02 11:00 -0700
          Re: Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-02 13:11 -0700

#107942 — Python3 html scraper that supports javascript

Fromzljubisic@gmail.com
Date2016-05-01 07:19 -0700
SubjectPython3 html scraper that supports javascript
Message-ID<2a0c92ed-352d-455c-832d-c9a9438f318b@googlegroups.com>
Hi,

can you please recommend to me a python3 library that I can use for scrapping JS that works on windows as well as linux?

Regards.

[toc] | [next] | [standalone]


#107946

FromBob Gailer <bgailer@gmail.com>
Date2016-05-01 13:01 -0400
Message-ID<mailman.285.1462122077.32212.python-list@python.org>
In reply to#107942
On May 1, 2016 10:20 AM, <zljubisic@gmail.com> wrote:
>
> Hi,
>
> can you please recommend to me a python3 library that I can use for
scrapping JS
I'm not sure what you mean by that. The tool I use is Splinter. Install it
using pip.
that works on windows as well as linux?

[toc] | [prev] | [next] | [standalone]


#108016

Fromzljubisic@gmail.com
Date2016-05-02 08:33 -0700
Message-ID<d8db7fec-0083-44ef-8f5b-73d097789b9b@googlegroups.com>
In reply to#107946

I tried to use the following code:

from bs4 import BeautifulSoup
from selenium import webdriver

PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'

url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'

browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get(url)

soup = BeautifulSoup(browser.page_source, "html.parser")

x = soup.prettify()

print(x)


When I print x variable, I would expect to see something like this:
<video src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58" id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source type="application/x-mpegURL" src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar">
</video>

but I can't come to that point.

Regards.

[toc] | [prev] | [next] | [standalone]


#108019

FromDFS <nospam@dfs.com>
Date2016-05-02 12:39 -0400
Message-ID<ng7vln$n2c$1@dont-email.me>
In reply to#108016
On 5/2/2016 11:33 AM, zljubisic@gmail.com wrote:
>
>
> I tried to use the following code:
>
> from bs4 import BeautifulSoup
> from selenium import webdriver
>
> PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
>
> url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'
>
> browser = webdriver.PhantomJS(PHANTOMJS_PATH)
> browser.get(url)
>
> soup = BeautifulSoup(browser.page_source, "html.parser")
>
> x = soup.prettify()
>
> print(x)
>
>
> When I print x variable, I would expect to see something like this:
> <video src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58" id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source type="application/x-mpegURL" src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar">
> </video>
>
> but I can't come to that point.
>
> Regards.


I was doing something similar recently.  Try this:

f = open(somefilename)
soup = BeautifulSoup.BeautifulSoup(f)
f.close()
print soup.prettify()

[toc] | [prev] | [next] | [standalone]


#108024

FromStephen Hansen <me+python@ixokai.io>
Date2016-05-02 11:00 -0700
Message-ID<mailman.327.1462212027.32212.python-list@python.org>
In reply to#108016
On Mon, May 2, 2016, at 08:33 AM, zljubisic@gmail.com wrote:
> I tried to use the following code:
> 
> from bs4 import BeautifulSoup
> from selenium import webdriver
> 
> PHANTOMJS_PATH =
> 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe'
> 
> url =
> 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film'
> 
> browser = webdriver.PhantomJS(PHANTOMJS_PATH)
> browser.get(url)
> 
> soup = BeautifulSoup(browser.page_source, "html.parser")
> 
> x = soup.prettify()
> 
> print(x)
> 
> When I print x variable, I would expect to see something like this:
> <video
> src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58"
> id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source
> type="application/x-mpegURL"
> src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar">
> </video>
> 
> but I can't come to that point.

Why? As important as it is to show code, you need to show what actually
happens and what error message is produced.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]


#108030

Fromzljubisic@gmail.com
Date2016-05-02 13:11 -0700
Message-ID<3d4a30a6-44bc-42bb-a3ef-4c993240558a@googlegroups.com>
In reply to#108024
> Why? As important as it is to show code, you need to show what actually
> happens and what error message is produced.

If you run the code you will see that html that I got doesn't have link to the flash video. I should somehow do something (press play video button maybe) in order to get html with reference to the video file on this page.

Regards

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web