Groups > comp.lang.python > #107967 > unrolled thread

Fastest way to retrieve and write html contents to file

Started by	DFS <nospam@dfs.com>
First post	2016-05-02 00:06 -0400
Last post	2016-05-02 07:38 -0500
Articles	20 on this page of 32 — 8 participants

Back to article view | Back to comp.lang.python

  Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 00:06 -0400
    Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 21:34 -0700
    Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 14:40 +1000
      Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 00:50 -0400
        Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:00 -0700
          Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:04 -0400
            Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 15:12 +1000
            Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:17 -0700
            Re: Fastest way to retrieve and write html contents to file Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-02 15:57 +1000
    Re: Fastest way to retrieve and write html contents to file Ben Finney <ben+python@benfinney.id.au> - 2016-05-02 14:49 +1000
      Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:00 -0400
        Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 22:15 -0700
          Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 01:59 -0400
            Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-01 23:27 -0700
              Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 03:37 -0400
                Re: Fastest way to retrieve and write html contents to file Stephen Hansen <me+python@ixokai.io> - 2016-05-02 00:58 -0700
                Re: Fastest way to retrieve and write html contents to file Michael Torrie <torriem@gmail.com> - 2016-05-02 22:06 -0600
                  Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-03 00:24 -0400
                    Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-03 10:28 -0500
                      Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-03 13:00 -0400
                        Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-03 13:41 -0500
                          Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-04 02:10 -0400
        Re: Fastest way to retrieve and write html contents to file Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-05-02 16:05 +1000
          Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 02:47 -0400
            Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 17:19 +1000
              Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 21:51 -0400
                Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-03 12:00 +1000
                  Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 22:01 -0400
            Re: Fastest way to retrieve and write html contents to file Peter Otten <__peter__@web.de> - 2016-05-02 10:42 +0200
              Re: Fastest way to retrieve and write html contents to file DFS <nospam@dfs.com> - 2016-05-02 21:52 -0400
    Re: Fastest way to retrieve and write html contents to file Chris Angelico <rosuav@gmail.com> - 2016-05-02 14:53 +1000
    Re: Fastest way to retrieve and write html contents to file Tim Chase <python.list@tim.thechases.com> - 2016-05-02 07:38 -0500

Page 1 of 2 [1] 2 Next page →

#107967 — Fastest way to retrieve and write html contents to file

From	DFS <nospam@dfs.com>
Date	2016-05-02 00:06 -0400
Subject	Fastest way to retrieve and write html contents to file
Message-ID	<ng6jie$1ap$1@dont-email.me>

I posted a little while ago about how short the python code was:

-------------------------------------
1. import urllib
2. urllib.urlretrieve(webpage, filename)
-------------------------------------

Which is very sweet compared to the VBScript version:

------------------------------------------------------
1. Option Explicit
2. Dim xmlHTTP, fso, fOut
3. Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
4. xmlHTTP.Open "GET", webpage
5. xmlHTTP.Send
6. Set fso = CreateObject("Scripting.FileSystemObject")
7. Set fOut = fso.CreateTextFile(filename, True)
8.  fOut.WriteLine xmlHTTP.ResponseText
9. fOut.Close
10. Set fOut = Nothing
11. Set fso  = Nothing
12. Set xmlHTTP = Nothing
------------------------------------------------------

Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 
iterations, vs 0.88 for python.

webpage = 'http://econpy.pythonanywhere.com/ex/001.html'


So I tried:
---------------------------
import urllib2
r = urllib2.urlopen(webpage)
f = open(filename,"w")
f.write(r.read())
f.close
---------------------------
and
---------------------------
import requests
r = requests.get(webpage)
f = open(filename,"w")
f.write(r.text)
f.close
---------------------------
and
---------------------------------
import pycurl
with open(filename, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, webpage)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
---------------------------------

urllib2 and requests were about the same speed as urllib.urlretrieve, 
while pycurl was significantly slower (1.2 seconds).

I'm running Win 8.1.  python 2.7.11 32-bit.

I know it's asking a lot, but is there a really fast AND really short 
python solution for this simple thing?


Thanks!

[toc] | [next] | [standalone]

#107969

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-01 21:34 -0700
Message-ID	<mailman.296.1462163673.32212.python-list@python.org>
In reply to	#107967

On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 
> iterations, vs 0.88 for python.
...
> I know it's asking a lot, but is there a really fast AND really short 
> python solution for this simple thing?

0.88 is not fast enough for you? That's less then a second.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#107970

From	Chris Angelico <rosuav@gmail.com>
Date	2016-05-02 14:40 +1000
Message-ID	<mailman.297.1462164034.32212.python-list@python.org>
In reply to	#107967

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen <me+python@ixokai.io> wrote:
> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>> iterations, vs 0.88 for python.
> ...
>> I know it's asking a lot, but is there a really fast AND really short
>> python solution for this simple thing?
>
> 0.88 is not fast enough for you? That's less then a second.

Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA

[toc] | [prev] | [next] | [standalone]

#107972

From	DFS <nospam@dfs.com>
Date	2016-05-02 00:50 -0400
Message-ID	<ng6m4q$743$1@dont-email.me>
In reply to	#107970

On 5/2/2016 12:40 AM, Chris Angelico wrote:
> On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen <me+python@ixokai.io> wrote:
>> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
>>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>>> iterations, vs 0.88 for python.
>> ...
>>> I know it's asking a lot, but is there a really fast AND really short
>>> python solution for this simple thing?
>>
>> 0.88 is not fast enough for you? That's less then a second.
>
> Also, this is timings of network and disk operations. Unless something
> pathological is happening, the language used won't make any
> difference.
>
> ChrisA


Unfortunately, the VBScript is twice as fast as any python method.

[toc] | [prev] | [next] | [standalone]

#107975

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-01 22:00 -0700
Message-ID	<mailman.300.1462165235.32212.python-list@python.org>
In reply to	#107972

On Sun, May 1, 2016, at 09:50 PM, DFS wrote:
> On 5/2/2016 12:40 AM, Chris Angelico wrote:
> > On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen <me+python@ixokai.io> wrote:
> >> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
> >>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> >>> iterations, vs 0.88 for python.
> >> ...
> >>> I know it's asking a lot, but is there a really fast AND really short
> >>> python solution for this simple thing?
> >>
> >> 0.88 is not fast enough for you? That's less then a second.
> >
> > Also, this is timings of network and disk operations. Unless something
> > pathological is happening, the language used won't make any
> > difference.
> >
> > ChrisA
> 
> 
> Unfortunately, the VBScript is twice as fast as any python method.

And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
as fast' isn't particularly meaningful as a metric. 

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#107978

From	DFS <nospam@dfs.com>
Date	2016-05-02 01:04 -0400
Message-ID	<ng6mu7$97m$1@dont-email.me>
In reply to	#107975

On 5/2/2016 1:00 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 09:50 PM, DFS wrote:
>> On 5/2/2016 12:40 AM, Chris Angelico wrote:
>>> On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen <me+python@ixokai.io> wrote:
>>>> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
>>>>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>>>>> iterations, vs 0.88 for python.
>>>> ...
>>>>> I know it's asking a lot, but is there a really fast AND really short
>>>>> python solution for this simple thing?
>>>>
>>>> 0.88 is not fast enough for you? That's less then a second.
>>>
>>> Also, this is timings of network and disk operations. Unless something
>>> pathological is happening, the language used won't make any
>>> difference.
>>>
>>> ChrisA
>>
>>
>> Unfortunately, the VBScript is twice as fast as any python method.
>
> And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
> as fast' isn't particularly meaningful as a metric.

0.2 is half as fast as 0.1, here.

And two small numbers turn into bigger numbers when the webpage is big, 
and soon the download time differences are measured in minutes, not half 
a second.

So, any ideas?

[toc] | [prev] | [next] | [standalone]

#107980

From	Chris Angelico <rosuav@gmail.com>
Date	2016-05-02 15:12 +1000
Message-ID	<mailman.302.1462165945.32212.python-list@python.org>
In reply to	#107978

On Mon, May 2, 2016 at 3:04 PM, DFS <nospam@dfs.com> wrote:
> And two small numbers turn into bigger numbers when the webpage is big, and
> soon the download time differences are measured in minutes, not half a
> second.
>
> So, any ideas?

So, measure with bigger web pages, and find out whether it's really a
2:1 ratio or a half-second difference. When download times are
measured in minutes, a half second difference is insignificant.

Extrapolating is dangerous.
https://xkcd.com/605/

ChrisA

[toc] | [prev] | [next] | [standalone]

#107982

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-01 22:17 -0700
Message-ID	<mailman.304.1462166263.32212.python-list@python.org>
In reply to	#107978

On Sun, May 1, 2016, at 10:04 PM, DFS wrote:
> And two small numbers turn into bigger numbers when the webpage is big, 
> and soon the download time differences are measured in minutes, not half 
> a second.

Are you sure of that? Have you determined that the time is not a
constant overhead verses that the time is directly relational to the
size of the page? If so, how have you determined that?

You aren't showing how you're testing. 0.4s difference is meaningless to
me, if its a constant overhead. If its twice as slow for a 1 meg file,
then you might have an issue. Maybe. You haven't shown that.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#107988

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-05-02 15:57 +1000
Message-ID	<5726ec2f$0$2905$c3e8da3$76491128@news.astraweb.com>
In reply to	#107978

On Monday 02 May 2016 15:04, DFS wrote:

> 0.2 is half as fast as 0.1, here.
> 
> And two small numbers turn into bigger numbers when the webpage is big,
> and soon the download time differences are measured in minutes, not half
> a second.

It takes twice as long to screw a screw into timber than to hammer a nail 
into the same timber.

Therefore if builders change from nails to screws, they can finish building 
the house in half the time.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#107971

From	Ben Finney <ben+python@benfinney.id.au>
Date	2016-05-02 14:49 +1000
Message-ID	<mailman.298.1462164614.32212.python-list@python.org>
In reply to	#107967

DFS <nospam@dfs.com> writes:

> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> iterations, vs 0.88 for python.
>
> […]
>
> urllib2 and requests were about the same speed as urllib.urlretrieve,
> while pycurl was significantly slower (1.2 seconds).

Network access is notoriously erratic in its timing. The program, and
the machine on which it runs, is subject to a great many external
effects once the request is sent — effects which will significantly
alter the delay before a response is completed.

How have you controlled for the wide variability in the duration, for
even a given request by the *same code on the same machine*, at
different points in time?

One simple way to do that: Run the exact same test many times (say,
10 000 or so) on the same machine, and then compute the average of all
the durations.

Do the same for each different program, and then you may have more
meaningfully comparable measurements.

-- 
 \     “We are no more free to believe whatever we want about God than |
  `\         we are free to adopt unjustified beliefs about science or |
_o__)              history […].” —Sam Harris, _The End of Faith_, 2004 |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#107976

From	DFS <nospam@dfs.com>
Date	2016-05-02 01:00 -0400
Message-ID	<ng6mn7$8nv$1@dont-email.me>
In reply to	#107971

On 5/2/2016 12:49 AM, Ben Finney wrote:
> DFS <nospam@dfs.com> writes:
>
>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>> iterations, vs 0.88 for python.
>>
>> […]
>>
>> urllib2 and requests were about the same speed as urllib.urlretrieve,
>> while pycurl was significantly slower (1.2 seconds).
>
> Network access is notoriously erratic in its timing. The program, and
> the machine on which it runs, is subject to a great many external
> effects once the request is sent — effects which will significantly
> alter the delay before a response is completed.
>
> How have you controlled for the wide variability in the duration, for
> even a given request by the *same code on the same machine*, at
> different points in time?
>
> One simple way to do that: Run the exact same test many times (say,
> 10 000 or so) on the same machine, and then compute the average of all
> the durations.
>
> Do the same for each different program, and then you may have more
> meaningfully comparable measurements.


I tried the 10-loop test several times with all versions.

The results were 100% consistent: VBSCript xmlHTTP was always 2x faster 
than any python method.

[toc] | [prev] | [next] | [standalone]

#107981

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-01 22:15 -0700
Message-ID	<mailman.303.1462166138.32212.python-list@python.org>
In reply to	#107976

On Sun, May 1, 2016, at 10:00 PM, DFS wrote:
> I tried the 10-loop test several times with all versions.

Also how, _exactly_, are you testing this?

C:\Python27>python -m timeit "filename='C:\\test.txt';
webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
r = urllib2.urlopen(webpage); f = open(filename, 'w');
f.write(r.read()); f.close();"
10 loops, best of 3: 175 msec per loop

That's a whole lot less the 0.88secs.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#107989

From	DFS <nospam@dfs.com>
Date	2016-05-02 01:59 -0400
Message-ID	<ng6q67$gnm$1@dont-email.me>
In reply to	#107981

On 5/2/2016 1:15 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 10:00 PM, DFS wrote:
>> I tried the 10-loop test several times with all versions.
>
> Also how, _exactly_, are you testing this?
>
> C:\Python27>python -m timeit "filename='C:\\test.txt';
> webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
> r = urllib2.urlopen(webpage); f = open(filename, 'w');
> f.write(r.read()); f.close();"
> 10 loops, best of 3: 175 msec per loop
>
> That's a whole lot less the 0.88secs.

Indeed.


---------------------------------------------------------------------
import requests, urllib, urllib2, pycurl
import time

webpage = "http://econpy.pythonanywhere.com/ex/001.html"
webfile = "D:\\econpy001.html"
loops   = 10

startTime = time.clock()	
for i in range(loops):
	urllib.urlretrieve(webpage,webfile)
endTime = time.clock()		
print "Finished urllib in %.2g seconds" %(endTime-startTime)

startTime = time.clock()	
for i in range(loops):
	r = urllib2.urlopen(webpage)
	f = open(webfile,"w")
	f.write(r.read())
	f.close
endTime = time.clock()		
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

startTime = time.clock()	
for i in range(loops):
	r = requests.get(webpage)
	f = open(webfile,"w")
	f.write(r.text)
	f.close
endTime = time.clock()		
print "Finished requests in %.2g seconds" %(endTime-startTime)

startTime = time.clock()	
for i in range(loops):
	with open(webfile + str(i) + ".txt", 'wb') as f:
		c = pycurl.Curl()
		c.setopt(c.URL, webpage)
		c.setopt(c.WRITEDATA, f)
		c.perform()
		c.close()
endTime = time.clock()		
print "Finished pycurl in %.2g seconds" %(endTime-startTime)
---------------------------------------------------------------------

$ python getHTML.py
Finished urllib in 0.88 seconds
Finished urllib2 in 0.83 seconds
Finished requests in 0.89 seconds
Finished pycurl in 1.1 seconds

Those results are consistent.  They go up or down a little, but never 
below 0.82 seconds (for urllib2), or above 1.2 seconds (for pycurl)

VBScript is consistently 0.44 to 0.48

[toc] | [prev] | [next] | [standalone]

#107992

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-01 23:27 -0700
Message-ID	<mailman.308.1462170455.32212.python-list@python.org>
In reply to	#107989

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
> startTime = time.clock()        
> for i in range(loops):
> 	r = urllib2.urlopen(webpage)
> 	f = open(webfile,"w")
> 	f.write(r.read())
> 	f.close
> endTime = time.clock()          
> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

Yeah on my system I get 1.8 out of this, amounting to 0.18s. 

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all. 

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant. 

If you believe otherwise, demonstrate it.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#107998

From	DFS <nospam@dfs.com>
Date	2016-05-02 03:37 -0400
Message-ID	<ng6vsl$1jc$1@dont-email.me>
In reply to	#107992

On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
>> startTime = time.clock()
>> for i in range(loops):
>> 	r = urllib2.urlopen(webpage)
>> 	f = open(webfile,"w")
>> 	f.write(r.read())
>> 	f.close
>> endTime = time.clock()
>> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)
>
> Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops?  That's less than half as 
fast as my results.  Surprising.

> I'm again going back to the point of: its fast enough. When comparing
> two small numbers, "twice as slow" is meaningless.

Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful 
language.  I'm extremely impressed by how tight the code can get.

> You have an assumption you haven't answered, that downloading a 10 meg
> file will be twice as slow as downloading this tiny file. You haven't
> proven that at all.

True.  And it has been my assumption - tho not with 10MB file.

> I suspect you have a constant overhead of X, and in this toy example,
> that makes it seem twice as slow. But when downloading a file of size,
> you'll have the same constant factor, at which point the difference is
> irrelevant.

Good point.  Test below.

> If you believe otherwise, demonstrate it.

http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), 
so this is 16.6x larger.  So I would expect python to linearly run in 
16.6 * 0.88 = 14.6 seconds.

10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75 
seconds - way too fast, so I checked and no data was written to file, 
and I couldn't even open the webpage with a browser.  Looks like I had 
been temporarily blocked from the site.  After a couple minutes, I was 
able to access it again).

I noticed urllib and curl returned the html as is, but urllib2 and 
requests added enhancements that should make the data easier to parse. 
Based on speed and functionality and documentation, I believe I'll be 
using the requests HTTP library (I will actually be doing a small amount 
of web scraping).

VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.

[toc] | [prev] | [next] | [standalone]

#108000

From	Stephen Hansen <me+python@ixokai.io>
Date	2016-05-02 00:58 -0700
Message-ID	<mailman.312.1462175886.32212.python-list@python.org>
In reply to	#107998

On Mon, May 2, 2016, at 12:37 AM, DFS wrote:
> On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> > I'm again going back to the point of: its fast enough. When comparing
> > two small numbers, "twice as slow" is meaningless.
> 
> Speed is always meaningful.
> 
> I know python is relatively slow, but it's a cool, concise, powerful 
> language.  I'm extremely impressed by how tight the code can get.

I'm sorry, but no. Speed is not always meaningful. 

It's not even usually meaningful, because you can't quantify what "speed
is". In context, you're claiming this is twice as slow (even though my
tests show dramatically better performance), but what details are
different?

You're ignoring the fact that Python might have a constant overhead --
meaning, for a 1k download, it might have X speed cost. For a 1meg
download, it might still have the exact same X cost.

Looking narrowly, that overhead looks like "twice as slow", but that's
not meaningful at all. Looking larger, that overhead is a pittance.

You aren't measuring that.

> > You have an assumption you haven't answered, that downloading a 10 meg
> > file will be twice as slow as downloading this tiny file. You haven't
> > proven that at all.
> 
> True.  And it has been my assumption - tho not with 10MB file.

And that assumption is completely invalid.

> I noticed urllib and curl returned the html as is, but urllib2 and 
> requests added enhancements that should make the data easier to parse. 
> Based on speed and functionality and documentation, I believe I'll be 
> using the requests HTTP library (I will actually be doing a small amount 
> of web scraping).

The requests library's added-value is ease-of-use, and its overhead is
likely tiny: so using it means you spend less effort making a thing
happen. I recommend you embrace this. 

> VBScript
> 1st run: 7.70 seconds
> 2nd run: 5.38
> 3rd run: 7.71
> 
> So python matches or beats VBScript at this much larger file.  Kewl.

This is what I'm talking about: Python might have a constant overhead,
but looking at larger operations, its probably comparable. Not fast,
mind you. Python isn't the fastest language out there. But in real world
work, its usually fast enough.

-- 
Stephen Hansen
  m e @ i x o k a i . i o

[toc] | [prev] | [next] | [standalone]

#108047

From	Michael Torrie <torriem@gmail.com>
Date	2016-05-02 22:06 -0600
Message-ID	<mailman.335.1462248430.32212.python-list@python.org>
In reply to	#107998

On 05/02/2016 01:37 AM, DFS wrote:
> So python matches or beats VBScript at this much larger file.  Kewl.

If you download something large enough to be meaningful, you'll find the
runtime speeds should all converge to something showing your internet
connection speed.  Try downloading a 4 GB file, for example.  You're
trying to benchmark an io-bound operation.  After you move past the very
small and meaningless examples that simply benchmark the overhead of the
connection building, you'll find that all languages, even compiled
languages like C, should run at the same speed on average.  Neither VBS
nor Python will be faster than each other.

Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.

[toc] | [prev] | [next] | [standalone]

#108048

From	DFS <nospam@dfs.com>
Date	2016-05-03 00:24 -0400
Message-ID	<ng98um$hjh$1@dont-email.me>
In reply to	#108047

On 5/3/2016 12:06 AM, Michael Torrie wrote:

> Now if you want to talk about processing the data once you have it,
> there we can talk about speeds and optimization.

Be glad to.  Helps me learn python, so bring whatever challenge you want 
and I'll try to keep up.

One small comparison I was able to make was VBA vs python/pyodbc to 
summarize an Access database.  Not quite a fair test, but interesting 
nonetheless.

---------------------------------------------------

Access 2003 file
Access 2003 VBA code

2,099,101 rows
114 tables  (max row = 600288)
971 columns
   text:      503
   boolean:   4
   numeric:   351
   date-time: 108
   binary:    5
309 indexes (25 foreign keys)
333,549,568 bytes on disk
Time: 0.18 seconds

---------------------------------------------------

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6

2,099,101 rows
114 tables (max row = 600288)
971  columns
   text:      503
   numeric:   351
   date-time: 108
   binary:    5
   boolean:   4
309 indexes (foreign keys na via ODBC*)
333,549,568 bytes on disk
Time: 0.49 seconds

* the Access ODBC driver doesn't support
   the SQLForeignKeys function

---------------------------------------------------

[toc] | [prev] | [next] | [standalone]

#108084

From	Tim Chase <python.list@tim.thechases.com>
Date	2016-05-03 10:28 -0500
Message-ID	<mailman.352.1462294245.32212.python-list@python.org>
In reply to	#108048

On 2016-05-03 00:24, DFS wrote:
> One small comparison I was able to make was VBA vs python/pyodbc to 
> summarize an Access database.  Not quite a fair test, but
> interesting nonetheless.
> 
> Access 2003 file
> Access 2003 VBA code
> Time: 0.18 seconds
>
> same Access 2003 file
> 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
> Time: 0.49 seconds

Curious whether you're forcing Access VBA to talk over ODBC or
whether Access is using native access/file-handling (and thus
bypassing the ODBC overhead)?

-tkc

[toc] | [prev] | [next] | [standalone]

#108087

From	DFS <nospam@dfs.com>
Date	2016-05-03 13:00 -0400
Message-ID	<ngal91$8c4$2@dont-email.me>
In reply to	#108084

On 5/3/2016 11:28 AM, Tim Chase wrote:
> On 2016-05-03 00:24, DFS wrote:
>> One small comparison I was able to make was VBA vs python/pyodbc to
>> summarize an Access database.  Not quite a fair test, but
>> interesting nonetheless.
>>
>> Access 2003 file
>> Access 2003 VBA code
>> Time: 0.18 seconds
>>
>> same Access 2003 file
>> 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
>> Time: 0.49 seconds
>
> Curious whether you're forcing Access VBA to talk over ODBC or
> whether Access is using native access/file-handling (and thus
> bypassing the ODBC overhead)?


The latter, which is why I said "not quite a fair test".

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Fastest way to retrieve and write html contents to file

Contents

#107967 — Fastest way to retrieve and write html contents to file

#107969

#107970

#107972

#107975

#107978

#107980

#107982

#107988

#107971

#107976

#107981

#107989

#107992

#107998

#108000

#108047

#108048

#108084

#108087