Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #107942 > unrolled thread
| Started by | zljubisic@gmail.com |
|---|---|
| First post | 2016-05-01 07:19 -0700 |
| Last post | 2016-05-02 13:11 -0700 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-01 07:19 -0700
Re: Python3 html scraper that supports javascript Bob Gailer <bgailer@gmail.com> - 2016-05-01 13:01 -0400
Re: Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-02 08:33 -0700
Re: Python3 html scraper that supports javascript DFS <nospam@dfs.com> - 2016-05-02 12:39 -0400
Re: Python3 html scraper that supports javascript Stephen Hansen <me+python@ixokai.io> - 2016-05-02 11:00 -0700
Re: Python3 html scraper that supports javascript zljubisic@gmail.com - 2016-05-02 13:11 -0700
| From | zljubisic@gmail.com |
|---|---|
| Date | 2016-05-01 07:19 -0700 |
| Subject | Python3 html scraper that supports javascript |
| Message-ID | <2a0c92ed-352d-455c-832d-c9a9438f318b@googlegroups.com> |
Hi, can you please recommend to me a python3 library that I can use for scrapping JS that works on windows as well as linux? Regards.
[toc] | [next] | [standalone]
| From | Bob Gailer <bgailer@gmail.com> |
|---|---|
| Date | 2016-05-01 13:01 -0400 |
| Message-ID | <mailman.285.1462122077.32212.python-list@python.org> |
| In reply to | #107942 |
On May 1, 2016 10:20 AM, <zljubisic@gmail.com> wrote: > > Hi, > > can you please recommend to me a python3 library that I can use for scrapping JS I'm not sure what you mean by that. The tool I use is Splinter. Install it using pip. that works on windows as well as linux?
[toc] | [prev] | [next] | [standalone]
| From | zljubisic@gmail.com |
|---|---|
| Date | 2016-05-02 08:33 -0700 |
| Message-ID | <d8db7fec-0083-44ef-8f5b-73d097789b9b@googlegroups.com> |
| In reply to | #107946 |
I tried to use the following code: from bs4 import BeautifulSoup from selenium import webdriver PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' browser = webdriver.PhantomJS(PHANTOMJS_PATH) browser.get(url) soup = BeautifulSoup(browser.page_source, "html.parser") x = soup.prettify() print(x) When I print x variable, I would expect to see something like this: <video src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58" id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source type="application/x-mpegURL" src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar"> </video> but I can't come to that point. Regards.
[toc] | [prev] | [next] | [standalone]
| From | DFS <nospam@dfs.com> |
|---|---|
| Date | 2016-05-02 12:39 -0400 |
| Message-ID | <ng7vln$n2c$1@dont-email.me> |
| In reply to | #108016 |
On 5/2/2016 11:33 AM, zljubisic@gmail.com wrote: > > > I tried to use the following code: > > from bs4 import BeautifulSoup > from selenium import webdriver > > PHANTOMJS_PATH = 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' > > url = 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' > > browser = webdriver.PhantomJS(PHANTOMJS_PATH) > browser.get(url) > > soup = BeautifulSoup(browser.page_source, "html.parser") > > x = soup.prettify() > > print(x) > > > When I print x variable, I would expect to see something like this: > <video src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58" id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source type="application/x-mpegURL" src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar"> > </video> > > but I can't come to that point. > > Regards. I was doing something similar recently. Try this: f = open(somefilename) soup = BeautifulSoup.BeautifulSoup(f) f.close() print soup.prettify()
[toc] | [prev] | [next] | [standalone]
| From | Stephen Hansen <me+python@ixokai.io> |
|---|---|
| Date | 2016-05-02 11:00 -0700 |
| Message-ID | <mailman.327.1462212027.32212.python-list@python.org> |
| In reply to | #108016 |
On Mon, May 2, 2016, at 08:33 AM, zljubisic@gmail.com wrote: > I tried to use the following code: > > from bs4 import BeautifulSoup > from selenium import webdriver > > PHANTOMJS_PATH = > 'C:\\Users\\Zoran\\Downloads\\Obrisi\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe' > > url = > 'https://hrti.hrt.hr/#/video/show/2203605/trebizat-prica-o-jednoj-vodi-i-jednom-narodu-dokumentarni-film' > > browser = webdriver.PhantomJS(PHANTOMJS_PATH) > browser.get(url) > > soup = BeautifulSoup(browser.page_source, "html.parser") > > x = soup.prettify() > > print(x) > > When I print x variable, I would expect to see something like this: > <video > src="mediasource:https://hrti.hrt.hr/2e9e9c45-aa23-4d08-9055-cd2d7f2c4d58" > id="vjs_video_3_html5_api" class="vjs-tech" preload="none"><source > type="application/x-mpegURL" > src="https://prd-hrt.spectar.tv/player/get_smil/id/2203605/video_id/2203605/token/Cny6ga5VEQSJ2uZaD2G8pg/token_expiration/1462043309/asset_type/Movie/playlist_template/nginx/channel_name/trebiat__pria_o_jednoj_vodi_i_jednom_narodu_dokumentarni_film/playlist.m3u8?foo=bar"> > </video> > > but I can't come to that point. Why? As important as it is to show code, you need to show what actually happens and what error message is produced. -- Stephen Hansen m e @ i x o k a i . i o
[toc] | [prev] | [next] | [standalone]
| From | zljubisic@gmail.com |
|---|---|
| Date | 2016-05-02 13:11 -0700 |
| Message-ID | <3d4a30a6-44bc-42bb-a3ef-4c993240558a@googlegroups.com> |
| In reply to | #108024 |
> Why? As important as it is to show code, you need to show what actually > happens and what error message is produced. If you run the code you will see that html that I got doesn't have link to the flash video. I should somehow do something (press play video button maybe) in order to get html with reference to the video file on this page. Regards
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web