Path: csiph.com!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail
MIME-Version: 1.0
In-Reply-To: <c18cdeb3-58f7-4dc4-82e7-b45b34f1c813@googlegroups.com>
References: <c18cdeb3-58f7-4dc4-82e7-b45b34f1c813@googlegroups.com>
Date: Wed, 16 Sep 2015 19:40:14 +1000
Subject: Re: Reading in large logfiles, and processing lines in batches - maximising throughput?
From: Chris Angelico <rosuav@gmail.com>
Cc: "python-list@python.org" <python-list@python.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.627.1442396416.8327.python-list@python.org>
Lines: 41
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:96669

On Wed, Sep 16, 2015 at 7:27 PM, Victor Hooi <victorhooi@gmail.com> wrote:
> Also, I originally used grouper because I thought it better to process li=
nes in batches, rather than individually. However, is there actually any th=
roughput advantage from doing it this way in Python? Or is there a better w=
ay of getting better throughput?
>

I very much doubt it'll improve throughput; what you're doing there is
reading individual lines, then batching them up into blocks of 1000,
and then stepping through the batches. In terms of disk read
performance, you're already covered, because the file object should be
buffered; if you're not doing much actual work in Python, that's
probably where your bottleneck is. But keep in mind the basic rules of
performance optimization:

1) Don't.
2) For experts only: Don't yet.
3) Measure first.

If you remember only the first rule, you're going to be correct most
of the time. Write your code to be idiomatic and clean, and *don't
worry* about performance. The second rule comes into play once you
have a fully working program, and you find that it's running too
slowly. (For example, you run "cat filename >/dev/null" and it takes
half a second, but you run your program on the same input file and it
takes half a day.) Okay, so you know your program needs some work. But
which parts of it are actually taking the time? If you just stare at
your code and make a guess, *you will be wrong*. So you follow the
third rule: Add a boatload of timing marks to the code. They'll slow
it down, of course, but you'll usually find that large slabs of the
code are so fast you can't even measure the time they're taking, so
there's no point optimizing them in any way. Only once you've proven
(a) that your program is "too slow" (for some measure of "slow"), and
(b) that it's _this part_ that's taking the bulk of the time, *then*
you can start improving performance.

So get rid of the grouper; it's violating all three rules. Give the
program a try without it, and see if you actually have a problem at
all. Maybe you don't!

ChrisA