Groups > comp.lang.python > #93330 > unrolled thread

Parsing logfile with multi-line loglines, separated by timestamp?

Started by	Victor Hooi <victorhooi@gmail.com>
First post	2015-06-30 08:24 -0700
Last post	2015-07-01 15:03 +1000
Articles	4 — 2 participants

Back to article view | Back to comp.lang.python

  Parsing logfile with multi-line loglines, separated by timestamp? Victor Hooi <victorhooi@gmail.com> - 2015-06-30 08:24 -0700
    Re: Parsing logfile with multi-line loglines, separated by timestamp? Chris Angelico <rosuav@gmail.com> - 2015-07-01 02:02 +1000
      Re: Parsing logfile with multi-line loglines, separated by timestamp? Victor Hooi <victorhooi@gmail.com> - 2015-06-30 21:06 -0700
        Re: Parsing logfile with multi-line loglines, separated by timestamp? Chris Angelico <rosuav@gmail.com> - 2015-07-01 15:03 +1000

#93330 — Parsing logfile with multi-line loglines, separated by timestamp?

From	Victor Hooi <victorhooi@gmail.com>
Date	2015-06-30 08:24 -0700
Subject	Parsing logfile with multi-line loglines, separated by timestamp?
Message-ID	<b8916490-20f3-4070-86dc-821adadd895b@googlegroups.com>

Hi,

I'm trying to parse iostat -xt output using Python. The quirk with iostat is that the output for each second runs over multiple lines. For example:

06/30/2015 03:09:17 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.03    0.00    0.00   99.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.04    0.02    0.07     0.30     3.28    81.37     0.00   29.83    2.74   38.30   0.47   0.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00    11.62     0.00    0.23    0.19    2.13   0.16   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00    10.29     0.00    0.41    0.41    0.73   0.38   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00     9.12     0.00    0.36    0.35    1.20   0.34   0.00
xvdh              0.00     0.00    0.00    0.00     0.00     0.00    33.35     0.00    1.39    0.41    8.91   0.39   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00    11.66     0.00    0.46    0.46    0.00   0.37   0.00

06/30/2015 03:09:18 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50    0.00    0.00   99.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdh              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

06/30/2015 03:09:19 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50    0.00    0.00   99.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdh              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Essentially I need to parse the output in "chunks", where each chunk is separated by a timestamp.

I was looking at itertools.groupby(), but that doesn't seem to quite do what I want here - it seems more for grouping lines, where each is united by a common key, or something that you can use a function to check for.

Another thought was something like:

    for line in f:
        if line.count("/") == 2 and line.count(":") == 2:
            current_time = datetime.strptime(line.strip(), '%m/%d/%y %H:%M:%S')
        while line.count("/") != 2 and line.count(":") != 2:
            print(line)
            continue

But that didn't quite seem to work.

Is there a Pythonic way of parsing the above iostat output, and break it into chunks split by the timestamp?

Cheers,
Victor

[toc] | [next] | [standalone]

#93334

From	Chris Angelico <rosuav@gmail.com>
Date	2015-07-01 02:02 +1000
Message-ID	<mailman.192.1435680170.3674.python-list@python.org>
In reply to	#93330

On Wed, Jul 1, 2015 at 1:47 AM, Skip Montanaro <skip.montanaro@gmail.com> wrote:
> Maybe define a class which wraps a file-like object. Its next() method (or
> is it __next__() method?) can just buffer up lines starting with one which
> successfully parses as a timestamp, accumulates all the rest, until a blank
> line or EOF is seen, then return that, either as a list of strings, one
> massive string, or some higher level representation (presumably an instance
> of another class) which represents one "paragraph" of iostat output.

next() in Py2, __next__() in Py3. But I'd do it, instead, as a
generator - that takes care of all the details, and you can simply
yield useful information whenever you have it. Something like this
(untested):

def parse_iostat(lines):
    """Parse lines of iostat information, yielding ... something

    lines should be an iterable yielding separate lines of output
    """
    block = None
    for line in lines:
        line = line.strip()
        try:
            tm = datetime.datetime.strptime(line, "%m/%d/%Y %I:%M:%S %p")
            if block: yield block
            block = [tm]
        except ValueError:
            # It's not a new timestamp, so add it to the existing block
            block.append(line)
    if block: yield block

This is a fairly classic line-parsing generator. You can pass it a
file-like object, a list of strings, or anything else that it can
iterate over; it'll yield some sort of aggregate object representing
each time's block. In this case, all it does is append strings to a
list, so this will result in a series of lists of strings, each one
representing a single timestamp; you can parse the other lines in any
way you like and aggregate useful data. Usage would be something like
this:

with open("logfile") as f:
    for block in parse_iostat(f):
        # do stuff with block

This will work quite happily with an ongoing stream, too, so if you're
working with a pipe from a currently-running process, it'll pick stuff
up just fine. (However, since it uses the timestamp as its signature,
it won't yield anything till it gets the *next* timestamp. If the
blank line is sufficient to denote the end of a block, you could
change the loop to look for that instead.)

Hope that helps!

ChrisA

[toc] | [prev] | [next] | [standalone]

#93356

From	Victor Hooi <victorhooi@gmail.com>
Date	2015-06-30 21:06 -0700
Message-ID	<51f65e41-76e9-48c4-8f79-ba4ac060bbe3@googlegroups.com>
In reply to	#93334

Aha, cool, that's a good idea =) - it seems I should spend some time getting to know generators/iterators.

Also, sorry if this is basic, but once I have the "block" list itself, what is the best way to parse each relevant line?

In this case, the first line is a timestamp, the next two lines are system stats, and then a newline, and then one line for each block device.

I could just hardcode in the lines, but that seems ugly:

  for block in parse_iostat(f):
      for i, line in enumerate(block):
          if i == 0:
              print("timestamp is {}".format(line))
          elif i == 1 or i == 2:
              print("system stats: {}".format(line))
          elif i >= 4:
              print("disk stats: {}".format(line))

Is there a prettier or more Pythonic way of doing this?

Thanks,
Victor

On Wednesday, 1 July 2015 02:03:01 UTC+10, Chris Angelico  wrote:
> On Wed, Jul 1, 2015 at 1:47 AM, Skip Montanaro <skip.montanaro@gmail.com> wrote:
> > Maybe define a class which wraps a file-like object. Its next() method (or
> > is it __next__() method?) can just buffer up lines starting with one which
> > successfully parses as a timestamp, accumulates all the rest, until a blank
> > line or EOF is seen, then return that, either as a list of strings, one
> > massive string, or some higher level representation (presumably an instance
> > of another class) which represents one "paragraph" of iostat output.
> 
> next() in Py2, __next__() in Py3. But I'd do it, instead, as a
> generator - that takes care of all the details, and you can simply
> yield useful information whenever you have it. Something like this
> (untested):
> 
> def parse_iostat(lines):
>     """Parse lines of iostat information, yielding ... something
> 
>     lines should be an iterable yielding separate lines of output
>     """
>     block = None
>     for line in lines:
>         line = line.strip()
>         try:
>             tm = datetime.datetime.strptime(line, "%m/%d/%Y %I:%M:%S %p")
>             if block: yield block
>             block = [tm]
>         except ValueError:
>             # It's not a new timestamp, so add it to the existing block
>             block.append(line)
>     if block: yield block
> 
> This is a fairly classic line-parsing generator. You can pass it a
> file-like object, a list of strings, or anything else that it can
> iterate over; it'll yield some sort of aggregate object representing
> each time's block. In this case, all it does is append strings to a
> list, so this will result in a series of lists of strings, each one
> representing a single timestamp; you can parse the other lines in any
> way you like and aggregate useful data. Usage would be something like
> this:
> 
> with open("logfile") as f:
>     for block in parse_iostat(f):
>         # do stuff with block
> 
> This will work quite happily with an ongoing stream, too, so if you're
> working with a pipe from a currently-running process, it'll pick stuff
> up just fine. (However, since it uses the timestamp as its signature,
> it won't yield anything till it gets the *next* timestamp. If the
> blank line is sufficient to denote the end of a block, you could
> change the loop to look for that instead.)
> 
> Hope that helps!
> 
> ChrisA

[toc] | [prev] | [next] | [standalone]

#93357

From	Chris Angelico <rosuav@gmail.com>
Date	2015-07-01 15:03 +1000
Message-ID	<mailman.205.1435726989.3674.python-list@python.org>
In reply to	#93356

On Wed, Jul 1, 2015 at 2:06 PM, Victor Hooi <victorhooi@gmail.com> wrote:
> Aha, cool, that's a good idea =) - it seems I should spend some time getting to know generators/iterators.
>
> Also, sorry if this is basic, but once I have the "block" list itself, what is the best way to parse each relevant line?
>
> In this case, the first line is a timestamp, the next two lines are system stats, and then a newline, and then one line for each block device.
>
> I could just hardcode in the lines, but that seems ugly:
>
>   for block in parse_iostat(f):
>       for i, line in enumerate(block):
>           if i == 0:
>               print("timestamp is {}".format(line))
>           elif i == 1 or i == 2:
>               print("system stats: {}".format(line))
>           elif i >= 4:
>               print("disk stats: {}".format(line))
>
> Is there a prettier or more Pythonic way of doing this?

This is where you get into the nitty-gritty of writing a text parser.
Most of the work is in figuring out exactly what pieces of information
matter to you. I recommend putting most of the work into the
parse_iostat() function, and then yielding some really nice tidy
package that can be interpreted conveniently.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Parsing logfile with multi-line loglines, separated by timestamp?

Contents

#93330 — Parsing logfile with multi-line loglines, separated by timestamp?

#93334

#93356

#93357