Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #101204 > unrolled thread

Ajax Request + Write to Json Extremely Slow (Webpage Crawler)

Started byjonafleuraime@gmail.com
First post2016-01-03 03:03 -0800
Last post2016-01-04 02:42 +1100
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  Ajax Request + Write to Json Extremely Slow (Webpage Crawler) jonafleuraime@gmail.com - 2016-01-03 03:03 -0800
    Re: Ajax Request + Write to Json Extremely Slow (Webpage Crawler) Steven D'Aprano <steve@pearwood.info> - 2016-01-04 02:42 +1100

#101204 — Ajax Request + Write to Json Extremely Slow (Webpage Crawler)

Fromjonafleuraime@gmail.com
Date2016-01-03 03:03 -0800
SubjectAjax Request + Write to Json Extremely Slow (Webpage Crawler)
Message-ID<43ddcfac-c810-4f85-9b6b-806503ea2b3d@googlegroups.com>
I'm editing a simple scraper that crawls a Youtube video's comment page. The crawler uses Ajax to page through comments on the page (infinite scroll) and then saves them to a json file. Even with small number of comments (< 5), it still takes 3+ min for the comments to be added to the json file.

I've tried including requests-cache and using ujson instead of json to see if there are any benefits but there's no noticeable difference.

You can view the code here: http://stackoverflow.com/questions/34575586/how-to-speed-up-ajax-requests-python-youtube-scraper

I'm new to Python so I'm not sure where the bottlenecks are. The finished script will be used to parse through 100,000+ comments so performance is a large factor.

-Would using multithreading solve the issue? And if so how would I refactor this to benefit from it?
-Or is this strictly a network issue?

Thanks!

[toc] | [next] | [standalone]


#101210

FromSteven D'Aprano <steve@pearwood.info>
Date2016-01-04 02:42 +1100
Message-ID<5689414a$0$1616$c3e8da3$5496439d@news.astraweb.com>
In reply to#101204
On Sun, 3 Jan 2016 10:03 pm, jonafleuraime@gmail.com wrote:

> I'm editing a simple scraper that crawls a Youtube video's comment page.
> The crawler uses Ajax to page through comments on the page (infinite
> scroll) and then saves them to a json file. Even with small number of
> comments (< 5), it still takes 3+ min for the comments to be added to the
> json file.
> 
> I've tried including requests-cache and using ujson instead of json to see
> if there are any benefits but there's no noticeable difference.

Before making random changes to the code to see if it speeds it up, try
running it under the profiler and see what it says.

https://pymotw.com/2/profile/index.html#module-profile

https://docs.python.org/2/library/profile.html



> You can view the code here:
>
http://stackoverflow.com/questions/34575586/how-to-speed-up-ajax-requests-python-youtube-scraper



I see that you already have an answer that you should try using threads
since the process is I/O bound. (The time taken is supposedly dominated by
the time it takes to download data from the internet.) That may be true,
but I also see something which *may* be a warning sign:


    while page_token:
        [...]
        page_token, html = response
        reply_cids += extract_reply_cids(html)


`reply_cids` is a list, and repeatedly calling += on a list *may* be slow.
If += is implemented the naive way, as addition and assignment, it probably
will be slow. This may entirely be a red herring, but if it were my code,
I'd try replacing that last line with:

        reply_cids.extend(extract_reply_cids(html))


and see if it makes any difference. If it doesn't, you can keep the new
version or revert back to the version using +=, entirely up to you.



-- 
Steven

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web