Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #107998

Re: Fastest way to retrieve and write html contents to file

From DFS <nospam@dfs.com>
Newsgroups comp.lang.python
Subject Re: Fastest way to retrieve and write html contents to file
Date 2016-05-02 03:37 -0400
Organization A noiseless patient Spider
Message-ID <ng6vsl$1jc$1@dont-email.me> (permalink)
References (4 earlier) <1462166136.1167243.595273897.291B0865@webmail.messagingengine.com> <mailman.303.1462166138.32212.python-list@python.org> <ng6q67$gnm$1@dont-email.me> <1462170452.1180117.595306673.68B64F02@webmail.messagingengine.com> <mailman.308.1462170455.32212.python-list@python.org>

Show all headers | View raw


On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
>> startTime = time.clock()
>> for i in range(loops):
>> 	r = urllib2.urlopen(webpage)
>> 	f = open(webfile,"w")
>> 	f.write(r.read())
>> 	f.close
>> endTime = time.clock()
>> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)
>
> Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops?  That's less than half as 
fast as my results.  Surprising.


> I'm again going back to the point of: its fast enough. When comparing
> two small numbers, "twice as slow" is meaningless.

Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful 
language.  I'm extremely impressed by how tight the code can get.


> You have an assumption you haven't answered, that downloading a 10 meg
> file will be twice as slow as downloading this tiny file. You haven't
> proven that at all.

True.  And it has been my assumption - tho not with 10MB file.


> I suspect you have a constant overhead of X, and in this toy example,
> that makes it seem twice as slow. But when downloading a file of size,
> you'll have the same constant factor, at which point the difference is
> irrelevant.

Good point.  Test below.


> If you believe otherwise, demonstrate it.

http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), 
so this is 16.6x larger.  So I would expect python to linearly run in 
16.6 * 0.88 = 14.6 seconds.

10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75 
seconds - way too fast, so I checked and no data was written to file, 
and I couldn't even open the webpage with a browser.  Looks like I had 
been temporarily blocked from the site.  After a couple minutes, I was 
able to access it again).

I noticed urllib and curl returned the html as is, but urllib2 and 
requests added enhancements that should make the data easier to parse. 
Based on speed and functionality and documentation, I believe I'll be 
using the requests HTTP library (I will actually be doing a small amount 
of web scraping).


VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 00:06 -0400
  Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 21:34 -0700
  Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 14:40 +1000
    Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 00:50 -0400
      Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:00 -0700
        Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:04 -0400
          Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 15:12 +1000
          Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:17 -0700
          Re: Fastest way to retrieve and write html contents to file Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-02 15:57 +1000
  Re: Fastest way to retrieve and write html contents to file Ben Finney <ben+python@benfinney.id.au> - 2016-05-02 14:49 +1000
    Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:00 -0400
      Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:15 -0700
        Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:59 -0400
          Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 23:27 -0700
            Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 03:37 -0400
              Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-02 00:58 -0700
              Re: Fastest way to retrieve and write html contents to file Michael Torrie <torriem@gmail.com> - 2016-05-02 22:06 -0600
                Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-03 00:24 -0400
                Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-03 10:28 -0500
                Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-03 13:00 -0400
                Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-03 13:41 -0500
                Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-04 02:10 -0400
      Re: Fastest way to retrieve and write html contents to file Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-02 16:05 +1000
        Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 02:47 -0400
          Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 17:19 +1000
            Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 21:51 -0400
              Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-03 12:00 +1000
                Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 22:01 -0400
          Re: Fastest way to retrieve and write html contents to file Peter Otten <__peter__@web.de> - 2016-05-02 10:42 +0200
            Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 21:52 -0400
  Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 14:53 +1000
  Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-02 07:38 -0500

csiph-web