Groups > comp.lang.python > #5510 > unrolled thread

Re: Convert AWK regex to Python

Started by	J <jnr.gonzalez@googlemail.com>
First post	2011-05-16 03:57 -0700
Last post	2011-05-16 07:01 -0700
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  Re: Convert AWK regex to Python J <jnr.gonzalez@googlemail.com> - 2011-05-16 03:57 -0700
    Re: Convert AWK regex to Python Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-16 11:29 +0000
    Re: Convert AWK regex to Python Peter Otten <__peter__@web.de> - 2011-05-16 13:36 +0200
      Re: Convert AWK regex to Python J <jnr.gonzalez@googlemail.com> - 2011-05-16 07:01 -0700

#5510 — Re: Convert AWK regex to Python

From	J <jnr.gonzalez@googlemail.com>
Date	2011-05-16 03:57 -0700
Subject	Re: Convert AWK regex to Python
Message-ID	<e3284e39-1d82-4016-b170-eeda342f3701@glegroupsg2000goo.googlegroups.com>

Hello Peter, Angelico,

Ok lets see, My aim is to filter out several fields from a log file and write them to a new log file.  The current log file, as I mentioned previously, has thousands of lines like this:-
2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >

All the lines in the log file are similar and they all have the same length (same amount of fields).  Most of the fields are separated by spaces except for couple of them which I am processing with AWK (removing "<G_" from the string for example).  So in essence what I want to do is evaluate each line in the log file and break them down into fields which I can call individually and write them to a new log file (for example selecting only fields 1, 2 and 3).

I hope this is clearer now

Regards,

Junior

[toc] | [next] | [standalone]

#5512

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-05-16 11:29 +0000
Message-ID	<4dd10a8a$0$29996$c3e8da3$5496439d@news.astraweb.com>
In reply to	#5510

On Mon, 16 May 2011 03:57:49 -0700, J wrote:

> Most of the fields are separated by
> spaces except for couple of them which I am processing with AWK
> (removing "<G_" from the string for example).  So in essence what I want
> to do is evaluate each line in the log file and break them down into
> fields which I can call individually and write them to a new log file
> (for example selecting only fields 1, 2 and 3).

fields = line.split(' ')
output.write(fields[1] + ' ')
output.write(fields[2] + ' ')
output.write(fields[3] + '\n')



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#5513

From	Peter Otten <__peter__@web.de>
Date	2011-05-16 13:36 +0200
Message-ID	<iqr24q$78k$1@solani.org>
In reply to	#5510

J wrote:

> Hello Peter, Angelico,
> 
> Ok lets see, My aim is to filter out several fields from a log file and
> write them to a new log file.  The current log file, as I mentioned
> previously, has thousands of lines like this:- 2011-05-16 09:46:22,361
> [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_
> CC_SMS_SERVICE_51408_656-ServerThread-
VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX
> - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004
> Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >
> 
> All the lines in the log file are similar and they all have the same
> length (same amount of fields).  Most of the fields are separated by
> spaces except for couple of them which I am processing with AWK (removing
> "<G_" from the string for example).  So in essence what I want to do is
> evaluate each line in the log file and break them down into fields which I
> can call individually and write them to a new log file (for example
> selecting only fields 1, 2 and 3).
> 
> I hope this is clearer now

Not much :( 

It doesn't really matter whether there are 100, 1000, or a million lines in 
the file; the important information is the structure of the file. You may be 
able to get away with a quick and dirty script consisting of just a few 
regular expressions, e. g.

import re

filename = ...

def get_service(line):
    return re.compile(r"[(](\w+)").search(line).group(1)

def get_command(line):
    return re.compile(r"<G_(\w+)").search(line).group(1)

def get_status(line):
    return re.compile(r"Status:\s+(\d+)").search(line).group(1)

with open(filename) as infile:
    for line in infile:
        print get_service(line), get_command(line), get_status(line)

but there is no guarantee that there isn't data in your file that breaks the 
implied assumptions. Also, from the shell hackery it looks like your 
ultimate goal seems to be a kind of frequency table which could be built 
along these lines:

freq = {}
with open(filename) as infile:
    for line in infile:
        service = get_service(line)
        command = get_command(line)
        status = get_status(line)
        key = command, service, status
        freq[key] = freq.get(key, 0) + 1

for key, occurences in sorted(freq.iteritems()):
    print "Service: {}, Command: {}, Status: {}, Occurences: {}".format(*key 
+ (occurences,))

[toc] | [prev] | [next] | [standalone]

#5518

From	J <jnr.gonzalez@googlemail.com>
Date	2011-05-16 07:01 -0700
Message-ID	<ecf27f48-26d2-44d1-851e-ed3cde2590b4@e17g2000prj.googlegroups.com>
In reply to	#5513

Thanks for the sugestions Peter, I will give them a try

Peter Otten wrote:
> J wrote:
>
> > Hello Peter, Angelico,
> >
> > Ok lets see, My aim is to filter out several fields from a log file and
> > write them to a new log file.  The current log file, as I mentioned
> > previously, has thousands of lines like this:- 2011-05-16 09:46:22,361
> > [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_
> > CC_SMS_SERVICE_51408_656-ServerThread-
> VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX
> > - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004
> > Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >
> >
> > All the lines in the log file are similar and they all have the same
> > length (same amount of fields).  Most of the fields are separated by
> > spaces except for couple of them which I am processing with AWK (removing
> > "<G_" from the string for example).  So in essence what I want to do is
> > evaluate each line in the log file and break them down into fields which I
> > can call individually and write them to a new log file (for example
> > selecting only fields 1, 2 and 3).
> >
> > I hope this is clearer now
>
> Not much :(
>
> It doesn't really matter whether there are 100, 1000, or a million lines in
> the file; the important information is the structure of the file. You may be
> able to get away with a quick and dirty script consisting of just a few
> regular expressions, e. g.
>
> import re
>
> filename = ...
>
> def get_service(line):
>     return re.compile(r"[(](\w+)").search(line).group(1)
>
> def get_command(line):
>     return re.compile(r"<G_(\w+)").search(line).group(1)
>
> def get_status(line):
>     return re.compile(r"Status:\s+(\d+)").search(line).group(1)
>
> with open(filename) as infile:
>     for line in infile:
>         print get_service(line), get_command(line), get_status(line)
>
> but there is no guarantee that there isn't data in your file that breaks the
> implied assumptions. Also, from the shell hackery it looks like your
> ultimate goal seems to be a kind of frequency table which could be built
> along these lines:
>
> freq = {}
> with open(filename) as infile:
>     for line in infile:
>         service = get_service(line)
>         command = get_command(line)
>         status = get_status(line)
>         key = command, service, status
>         freq[key] = freq.get(key, 0) + 1
>
> for key, occurences in sorted(freq.iteritems()):
>     print "Service: {}, Command: {}, Status: {}, Occurences: {}".format(*key
> + (occurences,))

[toc] | [prev] | [standalone]

csiph-web

Re: Convert AWK regex to Python

Contents

#5510 — Re: Convert AWK regex to Python

#5512

#5513

#5518