Path: csiph.com!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; '16,': 0.03; 'yet.': 0.03; 'lines,': 0.05; '(b)': 0.07; 'filename': 0.07; 'cc:addr:python-list': 0.09; 'optimizing': 0.09; 'rules.': 0.09; 'wed,': 0.15; '"cat': 0.16; 'batches.': 0.16; 'covered,': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'stepping': 0.16; 'subject:Reading': 0.16; 'throughput': 0.16; 'wrote:': 0.16; 'first.': 0.18; 'python?': 0.18; 'input': 0.18; '2015': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; '(a)': 0.22; 'rid': 0.22; 'sep': 0.22; 'code.': 0.23; 'originally': 0.23; 'second': 0.24; 'header:In-Reply-To:1': 0.24; 'all.': 0.24; 'message-id:@mail.gmail.com': 0.27; 'disk': 0.27; 'correct': 0.28; 'actual': 0.28; 'measure': 0.29; 'second,': 0.29; "they'll": 0.29; 'program,': 0.29; 'work.': 0.30; 'code': 0.30; 'rules': 0.31; 'probably': 0.31; "can't": 0.32; 'maybe': 0.33; 'getting': 0.33; 'run': 0.33; 'point': 0.33; 'problem': 0.33; 'third': 0.33; 'usually': 0.33; 'doubt': 0.33; 'rule': 0.33; '(for': 0.34; 'file': 0.34; 'running': 0.34; 'add': 0.34; 'received:google.com': 0.35; 'improving': 0.35; 'but': 0.36; 'too': 0.36; 'should': 0.36; 'there': 0.36; 'lines': 0.36; 'basic': 0.36; 'subject:?': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'thought': 0.37; 'doing': 0.38; 'takes': 0.39; 'rather': 0.39; 'where': 0.40; 'some': 0.40; 'your': 0.60; "you'll": 0.61; 'is.': 0.63; "they're": 0.66; 'experts': 0.70; 'bulk': 0.76; '"too': 0.84; '*you': 0.84; 'chrisa': 0.84; 'idiomatic': 0.84; 'only:': 0.84; 'taking,': 0.84; 'victor': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type:content-transfer-encoding; bh=/U5TQ3aLlsd0Jsiu7joIJJ/Y7rtYRCSSgskUEzowCxc=; b=I205Pa6ssJyM3KFdEl03zrlTBRIH3oPVVzrOsI2e2KsE/1tV/56um+l8SNGlnmtlOx 1qvDvhx9Q8Lfi/zJuL0UdgML8Qr0IiBnGm4iZIOmtGGermPG+ucoCYUSgLMmumS9U/UD KJK7ILhyrhe0yQTX879TclstYLdQ09qwmlT1XYaR9IuAHTTWCEkXFu7vFZZ/swtCfqx5 ZQQ/hR+vTKhk6DskGpjRvbeptqQW9BzNJtKANgCa/s+bb7/r1+z7HiKciGsX0CgYDSP0 W00fxHmqS6Sm64KPfEDp0tqLUf5+/HERg3pLsaHuCwwMtykoV16nUwaiz2eilPtVwCDy GLzw== MIME-Version: 1.0 X-Received: by 10.107.33.81 with SMTP id h78mr9796749ioh.19.1442396414174; Wed, 16 Sep 2015 02:40:14 -0700 (PDT) In-Reply-To: References: Date: Wed, 16 Sep 2015 19:40:14 +1000 Subject: Re: Reading in large logfiles, and processing lines in batches - maximising throughput? From: Chris Angelico Cc: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 41 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1442396416 news.xs4all.nl 23795 [2001:888:2000:d::a6]:37500 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:96669 On Wed, Sep 16, 2015 at 7:27 PM, Victor Hooi wrote: > Also, I originally used grouper because I thought it better to process li= nes in batches, rather than individually. However, is there actually any th= roughput advantage from doing it this way in Python? Or is there a better w= ay of getting better throughput? > I very much doubt it'll improve throughput; what you're doing there is reading individual lines, then batching them up into blocks of 1000, and then stepping through the batches. In terms of disk read performance, you're already covered, because the file object should be buffered; if you're not doing much actual work in Python, that's probably where your bottleneck is. But keep in mind the basic rules of performance optimization: 1) Don't. 2) For experts only: Don't yet. 3) Measure first. If you remember only the first rule, you're going to be correct most of the time. Write your code to be idiomatic and clean, and *don't worry* about performance. The second rule comes into play once you have a fully working program, and you find that it's running too slowly. (For example, you run "cat filename >/dev/null" and it takes half a second, but you run your program on the same input file and it takes half a day.) Okay, so you know your program needs some work. But which parts of it are actually taking the time? If you just stare at your code and make a guess, *you will be wrong*. So you follow the third rule: Add a boatload of timing marks to the code. They'll slow it down, of course, but you'll usually find that large slabs of the code are so fast you can't even measure the time they're taking, so there's no point optimizing them in any way. Only once you've proven (a) that your program is "too slow" (for some measure of "slow"), and (b) that it's _this part_ that's taking the bulk of the time, *then* you can start improving performance. So get rid of the grouper; it's violating all three rules. Give the program a try without it, and see if you actually have a problem at all. Maybe you don't! ChrisA