Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #101204 > unrolled thread
| Started by | jonafleuraime@gmail.com |
|---|---|
| First post | 2016-01-03 03:03 -0800 |
| Last post | 2016-01-04 02:42 +1100 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Ajax Request + Write to Json Extremely Slow (Webpage Crawler) jonafleuraime@gmail.com - 2016-01-03 03:03 -0800
Re: Ajax Request + Write to Json Extremely Slow (Webpage Crawler) Steven D'Aprano <steve@pearwood.info> - 2016-01-04 02:42 +1100
| From | jonafleuraime@gmail.com |
|---|---|
| Date | 2016-01-03 03:03 -0800 |
| Subject | Ajax Request + Write to Json Extremely Slow (Webpage Crawler) |
| Message-ID | <43ddcfac-c810-4f85-9b6b-806503ea2b3d@googlegroups.com> |
I'm editing a simple scraper that crawls a Youtube video's comment page. The crawler uses Ajax to page through comments on the page (infinite scroll) and then saves them to a json file. Even with small number of comments (< 5), it still takes 3+ min for the comments to be added to the json file. I've tried including requests-cache and using ujson instead of json to see if there are any benefits but there's no noticeable difference. You can view the code here: http://stackoverflow.com/questions/34575586/how-to-speed-up-ajax-requests-python-youtube-scraper I'm new to Python so I'm not sure where the bottlenecks are. The finished script will be used to parse through 100,000+ comments so performance is a large factor. -Would using multithreading solve the issue? And if so how would I refactor this to benefit from it? -Or is this strictly a network issue? Thanks!
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-01-04 02:42 +1100 |
| Message-ID | <5689414a$0$1616$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #101204 |
On Sun, 3 Jan 2016 10:03 pm, jonafleuraime@gmail.com wrote:
> I'm editing a simple scraper that crawls a Youtube video's comment page.
> The crawler uses Ajax to page through comments on the page (infinite
> scroll) and then saves them to a json file. Even with small number of
> comments (< 5), it still takes 3+ min for the comments to be added to the
> json file.
>
> I've tried including requests-cache and using ujson instead of json to see
> if there are any benefits but there's no noticeable difference.
Before making random changes to the code to see if it speeds it up, try
running it under the profiler and see what it says.
https://pymotw.com/2/profile/index.html#module-profile
https://docs.python.org/2/library/profile.html
> You can view the code here:
>
http://stackoverflow.com/questions/34575586/how-to-speed-up-ajax-requests-python-youtube-scraper
I see that you already have an answer that you should try using threads
since the process is I/O bound. (The time taken is supposedly dominated by
the time it takes to download data from the internet.) That may be true,
but I also see something which *may* be a warning sign:
while page_token:
[...]
page_token, html = response
reply_cids += extract_reply_cids(html)
`reply_cids` is a list, and repeatedly calling += on a list *may* be slow.
If += is implemented the naive way, as addition and assignment, it probably
will be slow. This may entirely be a red herring, but if it were my code,
I'd try replacing that last line with:
reply_cids.extend(extract_reply_cids(html))
and see if it makes any difference. If it doesn't, you can keep the new
version or revert back to the version using +=, entirely up to you.
--
Steven
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web