Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: Fastest way to retrieve and write html contents to file Date: Mon, 2 May 2016 17:19:01 +1000 Lines: 89 Message-ID: References: <85vb2xgj2i.fsf@benfinney.id.au> <5726ee33$0$1617$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de ojecVPhOhlc4Ei25amGQ6AEgnsdRNEZ8Kwf+t+tLirSw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'cache': 0.05; 'importerror:': 0.05; 'modified': 0.05; '(b)': 0.07; 'feature.': 0.07; 'linear': 0.07; 'subject:file': 0.07; 'true)': 0.07; 'cc:addr:python-list': 0.09; 'scripts': 0.09; '-------': 0.09; 'loop.': 0.09; 'mess': 0.09; 'messing': 0.09; 'specifying': 0.09; 'variables,': 0.09; 'python': 0.10; 'def': 0.13; 'appropriate': 0.14; 'url:)': 0.14; 'result.': 0.15; 'server,': 0.15; 'variables': 0.15; '"get",': 0.16; '2016': 0.16; '204': 0.16; 'cached,': 0.16; 'caching': 0.16; 'count;': 0.16; 'dfs': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'http.server': 0.16; 'iteration': 0.16; 'measuring': 0.16; 'naive': 0.16; 'overloaded': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'seconds,': 0.16; 'seconds.': 0.16; 'simulate': 0.16; 'url.': 0.16; 'webpage,': 0.16; 'xmlhttp': 0.16; 'wrote:': 0.16; 'obviously': 0.16; 'try:': 0.18; 'tests': 0.18; 'language': 0.19; 'changes': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '(a)': 0.22; "aren't": 0.22; 'delay': 0.22; 'explicit': 0.22; 'produces': 0.22; 'programming': 0.22; 'dependent': 0.23; 'this:': 0.23; 'second': 0.24; 'import': 0.24; 'skip:b 30': 0.24; 'header:In-Reply-To:1': 0.24; 'mon,': 0.24; 'requests': 0.25; "doesn't": 0.26; 'least': 0.27; 'message- id:@mail.gmail.com': 0.27; 'correct': 0.28; 'closer': 0.29; 'command-line': 0.29; 'fast.': 0.29; 'sleep': 0.29; 'windows,': 0.29; 'themselves': 0.29; "i'm": 0.30; 'that.': 0.30; 'option': 0.31; 'another': 0.32; '--------': 0.32; 'though,': 0.32; 'getting': 0.33; 'changed': 0.33; 'class': 0.33; 'downloading': 0.33; 'changing': 0.34; 'except': 0.34; 'server': 0.34; 'that,': 0.34; 'received:google.com': 0.35; 'skip:c 30': 0.35; 'next': 0.35; 'replace': 0.35; 'something': 0.35; 'comment': 0.35; 'but': 0.36; 'should': 0.36; 'received:209.85': 0.36; 'url:non-standard http port': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'being': 0.37; 'expect': 0.37; 'turn': 0.37; 'received:209.85.213': 0.37; 'times.': 0.38; "won't": 0.38; 'received:209': 0.38; 'skip:s 40': 0.38; 'log': 0.38; 'files': 0.38; 'test': 0.39; 'sure': 0.39; 'skip:- 60': 0.39; 'takes': 0.39; 'where': 0.40; 'ten': 0.60; 'your': 0.60; 'close': 0.61; 'show': 0.62; 'real': 0.62; 'back': 0.62; 'watch': 0.62; 'is.': 0.63; 'more': 0.63; 'url:0': 0.63; 'times': 0.63; 'webpage': 0.66; '100': 0.79; 'counts': 0.81; 'low': 0.83; 'chrisa': 0.84; 'subject:write': 0.84; 'together,': 0.84; 'to:none': 0.91; 'world:': 0.91; 'responses': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc; bh=nQQJ7Z4MpPSB/CFih09pL7CE2i34yEy69nsOWrkrDNY=; b=ctwK87pkIFkT39ryv0hPjdRFzNoqlfHaUblcuBN6j/cHJ+cWBznlPZ6we7vgmamhwR 7qWUmKcE7gtRTgmAidHjdDWByMNGKlQh06qQtcrTJ8GJGe68ktcJglS/RHC6kcBUB6ym vdaVDDbxaDzBTJ3mCHp5ZoBnexzEVFQuj8aYjc9+KwldNXEdNRYXP6+higtNdpv4Ktfb vFQj3oVPoTQbvgMxF/aSw6Dwtja0nxUk5KzyUEoLZceVQ2lBukVXGfaKJUCF2z1XcLu8 UYZQyXrrnguYQnJRExmznyxTbQowcTwFga/0ZF0thxXZliCx3OLnilEoTVgy/hioZccy i9Rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc; bh=nQQJ7Z4MpPSB/CFih09pL7CE2i34yEy69nsOWrkrDNY=; b=TEL++9BWFaj9J4eyAHH5Zu8hsFqdMTBFsMDMVtWUWEUSo+qvMGORatDaAMS/UM76qd INW4WXxYBFzmiSCzmW2qEEWHLgBLC0wW8XuvHhgMJP1SPray63Ag75j3J+G0HPyT1gTB Dy02C74JCGwxXj/zHOid8tFfQ8QUm4b/cFhsV8zu+/qUOGicsR4NNcmpxWPhY24DH9V1 OxBPzjhOc/Wgn3PBKX250KQbHSEfZKqgoQ6xhwhMQCl/rnXKqF6ZX/tGwFq+VNXekA5m skfQZf4ssOTTi5TI2q2nJkNsue7HAgCZYd1Upafvw7v0w69ON99G+B9B4WAN3J1z6LVK NUdA== X-Gm-Message-State: AOPr4FXqLAUk+KsgU9NBBYO2+st+Dsdl3eqANecDGVX93TxyGKSEGACK8gZIcfsHIRrT7kDyRI5ZRlRoUGs1Nw== X-Received: by 10.50.111.15 with SMTP id ie15mr19144637igb.94.1462173541685; Mon, 02 May 2016 00:19:01 -0700 (PDT) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <85vb2xgj2i.fsf@benfinney.id.au> <5726ee33$0$1617$c3e8da3$5496439d@news.astraweb.com> Xref: csiph.com comp.lang.python:107997 On Mon, May 2, 2016 at 4:47 PM, DFS wrote: > I'm not specifying a local web cache with either (wouldn't know how or where > to look). If you have Windows, you can try it. > ------------------------------------------------------------------- > Option Explicit > Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i > webpage = "http://econpy.pythonanywhere.com/ex/001.html" > webfile = "D:\econpy001.html" > startTime = Timer > For i = 1 to 10 > Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") > xmlHTTP.Open "GET", webpage > xmlHTTP.Send > Set fso = CreateObject("Scripting.FileSystemObject") > Set fOut = fso.CreateTextFile(webfile, True) > fOut.WriteLine xmlHTTP.ResponseText > fOut.Close > Set fOut = Nothing > Set fso = Nothing > Set xmlHTTP = Nothing > Next > endTime = Timer > wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) & > " seconds" > ------------------------------------------------------------------- There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). Then the next thing to test would be to create a deliberately-slow web server, and connect to that. Put a two-second delay into it, to simulate a distant or overloaded server, and see if your logs show the correct result. Something like this: -------- import time try: import http.server as BaseHTTPServer # Python 3 except ImportError: import BaseHTTPServer # Python 2 class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header("Content-type","text/html") self.end_headers() self.wfile.write(b"Hello, ") time.sleep(2) self.wfile.write(b"world!") server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP) server.serve_forever() ------- Test that with a web browser or command-line downloader (go to http://127.0.0.1:1234/), and make sure that (a) it produces "Hello, world!", and (b) it takes two seconds. Then set your test scripts to downloading that URL. (Be sure to set them back to low iteration counts first!) If the times are true and fair, they should all come out pretty much the same - ten iterations, twenty seconds. And since all that's changed is the server, this will be an accurate demonstration of what happens in the real world: network requests aren't always fast. Incidentally, you can also watch the server's log to see if it's getting the appropriate number of requests. It may turn out that changing the web server actually materially changes your numbers. Comment out the sleep call and try it again - you might find that your numbers come closer together, because this naive server doesn't send back 204 NOT MODIFIED responses or anything. Again, though, this would prove that you're not actually measuring language performance, because the tests are more dependent on the server than the client. Even if the files themselves aren't being cached, you might find that DNS is. So if you truly want to eliminate variables, replace the name in your URL with an IP address. It's another thing that might mess with your timings, without actually being a language feature. Networking has about four billion variables in it. You're messing with one of the least significant: the programming language :) ChrisA