Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #96667 > unrolled thread

Reading in large logfiles, and processing lines in batches - maximising throughput?

Started byVictor Hooi <victorhooi@gmail.com>
First post2015-09-16 02:27 -0700
Last post2015-09-16 19:40 +1000
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python


Contents

  Reading in large logfiles, and processing lines in batches - maximising throughput? Victor Hooi <victorhooi@gmail.com> - 2015-09-16 02:27 -0700
    Re: Reading in large logfiles, and processing lines in batches - maximising throughput? Chris Angelico <rosuav@gmail.com> - 2015-09-16 19:40 +1000

#96667 — Reading in large logfiles, and processing lines in batches - maximising throughput?

FromVictor Hooi <victorhooi@gmail.com>
Date2015-09-16 02:27 -0700
SubjectReading in large logfiles, and processing lines in batches - maximising throughput?
Message-ID<c18cdeb3-58f7-4dc4-82e7-b45b34f1c813@googlegroups.com>
I'm using Python to parse metrics out of logfiles.

The logfiles are fairly large (multiple GBs), so I'm keen to do this in a reasonably performant way.

The metrics are being sent to a InfluxDB database - so it's better if I can batch multiple metrics into a batch ,rather than sending them individually.

Currently, I'm using the grouper() recipe from the itertools documentation to process multiples lines in "chunks" - I then send the collected points to the database:

    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)

    with open(args.input_file, 'r') as f:
        line_counter = 0
        for chunk in grouper(f, args.batch_size):
            json_points = []
            for line in chunk:
                line_counter +=1
                # Do some processing
                json_points.append(some_metrics)
            if json_points:
                write_points(logger, client, json_points, line_counter)

However, not every line will produce metrics - so I'm batching on the number of input lines, rather than on the items I send to the database.

My question is, would it make sense to simply have a json_points list that accumulated metrics, check the size each iteration and then send them off when it reaches a certain size. Eg.:

    BATCH_SIZE = 1000

    with open(args.input_file, 'r') as f:
        json_points = []
        for line_number, line in enumerate(f):
            # Do some processing
            json_points.append(some_metrics)
            if len(json_points) >= BATCH_SIZE:
                write_points(logger, client, json_points, line_counter)
                json_points = []

Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput?

We can assume for now that the CPU load of the processing is fairly light (mainly string splitting, and date parsing).

[toc] | [next] | [standalone]


#96669

FromChris Angelico <rosuav@gmail.com>
Date2015-09-16 19:40 +1000
Message-ID<mailman.627.1442396416.8327.python-list@python.org>
In reply to#96667
On Wed, Sep 16, 2015 at 7:27 PM, Victor Hooi <victorhooi@gmail.com> wrote:
> Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput?
>

I very much doubt it'll improve throughput; what you're doing there is
reading individual lines, then batching them up into blocks of 1000,
and then stepping through the batches. In terms of disk read
performance, you're already covered, because the file object should be
buffered; if you're not doing much actual work in Python, that's
probably where your bottleneck is. But keep in mind the basic rules of
performance optimization:

1) Don't.
2) For experts only: Don't yet.
3) Measure first.

If you remember only the first rule, you're going to be correct most
of the time. Write your code to be idiomatic and clean, and *don't
worry* about performance. The second rule comes into play once you
have a fully working program, and you find that it's running too
slowly. (For example, you run "cat filename >/dev/null" and it takes
half a second, but you run your program on the same input file and it
takes half a day.) Okay, so you know your program needs some work. But
which parts of it are actually taking the time? If you just stare at
your code and make a guess, *you will be wrong*. So you follow the
third rule: Add a boatload of timing marks to the code. They'll slow
it down, of course, but you'll usually find that large slabs of the
code are so fast you can't even measure the time they're taking, so
there's no point optimizing them in any way. Only once you've proven
(a) that your program is "too slow" (for some measure of "slow"), and
(b) that it's _this part_ that's taking the bulk of the time, *then*
you can start improving performance.

So get rid of the grouper; it's violating all three rules. Give the
program a try without it, and see if you actually have a problem at
all. Maybe you don't!

ChrisA

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web