Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #96667 > unrolled thread
| Started by | Victor Hooi <victorhooi@gmail.com> |
|---|---|
| First post | 2015-09-16 02:27 -0700 |
| Last post | 2015-09-16 19:40 +1000 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Reading in large logfiles, and processing lines in batches - maximising throughput? Victor Hooi <victorhooi@gmail.com> - 2015-09-16 02:27 -0700
Re: Reading in large logfiles, and processing lines in batches - maximising throughput? Chris Angelico <rosuav@gmail.com> - 2015-09-16 19:40 +1000
| From | Victor Hooi <victorhooi@gmail.com> |
|---|---|
| Date | 2015-09-16 02:27 -0700 |
| Subject | Reading in large logfiles, and processing lines in batches - maximising throughput? |
| Message-ID | <c18cdeb3-58f7-4dc4-82e7-b45b34f1c813@googlegroups.com> |
I'm using Python to parse metrics out of logfiles.
The logfiles are fairly large (multiple GBs), so I'm keen to do this in a reasonably performant way.
The metrics are being sent to a InfluxDB database - so it's better if I can batch multiple metrics into a batch ,rather than sending them individually.
Currently, I'm using the grouper() recipe from the itertools documentation to process multiples lines in "chunks" - I then send the collected points to the database:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
with open(args.input_file, 'r') as f:
line_counter = 0
for chunk in grouper(f, args.batch_size):
json_points = []
for line in chunk:
line_counter +=1
# Do some processing
json_points.append(some_metrics)
if json_points:
write_points(logger, client, json_points, line_counter)
However, not every line will produce metrics - so I'm batching on the number of input lines, rather than on the items I send to the database.
My question is, would it make sense to simply have a json_points list that accumulated metrics, check the size each iteration and then send them off when it reaches a certain size. Eg.:
BATCH_SIZE = 1000
with open(args.input_file, 'r') as f:
json_points = []
for line_number, line in enumerate(f):
# Do some processing
json_points.append(some_metrics)
if len(json_points) >= BATCH_SIZE:
write_points(logger, client, json_points, line_counter)
json_points = []
Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput?
We can assume for now that the CPU load of the processing is fairly light (mainly string splitting, and date parsing).
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-09-16 19:40 +1000 |
| Message-ID | <mailman.627.1442396416.8327.python-list@python.org> |
| In reply to | #96667 |
On Wed, Sep 16, 2015 at 7:27 PM, Victor Hooi <victorhooi@gmail.com> wrote: > Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput? > I very much doubt it'll improve throughput; what you're doing there is reading individual lines, then batching them up into blocks of 1000, and then stepping through the batches. In terms of disk read performance, you're already covered, because the file object should be buffered; if you're not doing much actual work in Python, that's probably where your bottleneck is. But keep in mind the basic rules of performance optimization: 1) Don't. 2) For experts only: Don't yet. 3) Measure first. If you remember only the first rule, you're going to be correct most of the time. Write your code to be idiomatic and clean, and *don't worry* about performance. The second rule comes into play once you have a fully working program, and you find that it's running too slowly. (For example, you run "cat filename >/dev/null" and it takes half a second, but you run your program on the same input file and it takes half a day.) Okay, so you know your program needs some work. But which parts of it are actually taking the time? If you just stare at your code and make a guess, *you will be wrong*. So you follow the third rule: Add a boatload of timing marks to the code. They'll slow it down, of course, but you'll usually find that large slabs of the code are so fast you can't even measure the time they're taking, so there's no point optimizing them in any way. Only once you've proven (a) that your program is "too slow" (for some measure of "slow"), and (b) that it's _this part_ that's taking the bulk of the time, *then* you can start improving performance. So get rid of the grouper; it's violating all three rules. Give the program a try without it, and see if you actually have a problem at all. Maybe you don't! ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web