Groups > comp.lang.python > #89573 > unrolled thread

Python re to extract useful information from each line

Started by	Kashif Rana <kashifrana84@gmail.com>
First post	2015-04-29 13:42 -0700
Last post	2015-08-19 12:53 -0700
Articles	8 — 7 participants

Back to article view | Back to comp.lang.python

  Python re to extract useful information from each line Kashif Rana <kashifrana84@gmail.com> - 2015-04-29 13:42 -0700
    Re: Python re to extract useful information from each line Kashif Rana <kashifrana84@gmail.com> - 2015-04-29 13:49 -0700
      Re: Python re to extract useful information from each line Emile van Sebille <emile@fenx.com> - 2015-04-29 14:22 -0700
      Re: Python re to extract useful information from each line MRAB <python@mrabarnett.plus.com> - 2015-04-29 22:28 +0100
      Re: Python re to extract useful information from each line Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-04-29 22:30 +0100
      Re: Python re to extract useful information from each line Tim Chase <python.list@tim.thechases.com> - 2015-04-29 17:38 -0500
    Re: Python re to extract useful information from each line sohcahtoa82@gmail.com - 2015-04-29 16:29 -0700
    Re: Python re to extract useful information from each line Paul McGuire <ptmcg@austin.rr.com> - 2015-08-19 12:53 -0700

#89573 — Python re to extract useful information from each line

From	Kashif Rana <kashifrana84@gmail.com>
Date	2015-04-29 13:42 -0700
Subject	Python re to extract useful information from each line
Message-ID	<e5473ccc-4f7d-431d-93a7-1aeeededcbf0@googlegroups.com>

Hello Experts

I have below lines with some variations.

1- set policy id 1000 from "Untrust" to "Trust" "Any" "1.1.1.1" "HTTP" nat dst ip 10.10.10.10 port 8000 permit log

2- set policy id 5000 from "Trust" to "Untrust" "Any" "microsoft.com" "HTTP" nat src permit schedule "14August2014" log

3- set policy id 7000 from "Trust" to "Untrust" "Users" "Any" "ANY" nat src dip-id 4 permit log

4- set policy id 7000 from "Trust" to "Untrust" "servers" "Any" "ANY" deny

Please help me to write the regular expression to extract below information in parenthesis, if exist from each line. Please note that some items may exist or not like nat or log

set policy id (id) from (from) to (to) (source) (destination) (service) nat (src or dst) (dip-id 4) or (ip 10.10.10.10) port (dst-port) (action) schedule (schedule) (log)

[toc] | [next] | [standalone]

#89574

From	Kashif Rana <kashifrana84@gmail.com>
Date	2015-04-29 13:49 -0700
Message-ID	<220dafbc-25f0-48a7-b37a-c8a77a6f2ffa@googlegroups.com>
In reply to	#89573

On Thursday, April 30, 2015 at 12:42:18 AM UTC+4, Kashif Rana wrote:
> Hello Experts
> 
> I have below lines with some variations.
> 
> 1- set policy id 1000 from "Untrust" to "Trust" "Any" "1.1.1.1" "HTTP" nat dst ip 10.10.10.10 port 8000 permit log
> 
> 2- set policy id 5000 from "Trust" to "Untrust" "Any" "microsoft.com" "HTTP" nat src permit schedule "14August2014" log
> 
> 3- set policy id 7000 from "Trust" to "Untrust" "Users" "Any" "ANY" nat src dip-id 4 permit log
> 
> 4- set policy id 7000 from "Trust" to "Untrust" "servers" "Any" "ANY" deny
> 
> Please help me to write the regular expression to extract below information in parenthesis, if exist from each line. Please note that some items may exist or not like nat or log
> 
> set policy id (id) from (from) to (to) (source) (destination) (service) nat (src or dst) (dip-id 4) or (ip 10.10.10.10) port (dst-port) (action) schedule (schedule) (log)

I tried below re and its not working.

id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:\s(?P<p_nat_status>nat)\s(?P<p_nat_type>\w+)(\s?P<p_nat_src_ip>dip-id\s\d+)?(\sip\s(?P<p_nat_dst_ip>[\d\.]+)\sport(?P<dst_nat_port>\d+))?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$

If I ignore the line 1. I made below re and its working and giving me all info.

pol_elements = re.compile('id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:(?P<p_nat_status>\snat)\s(?P<p_nat_type>[^\s]+?)(?P<p_nat_ip>\sdip-id\s[^\s]+?)?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$'
)

[toc] | [prev] | [next] | [standalone]

#89578

From	Emile van Sebille <emile@fenx.com>
Date	2015-04-29 14:22 -0700
Message-ID	<mailman.98.1430342578.3680.python-list@python.org>
In reply to	#89574

On 4/29/2015 1:49 PM, Kashif Rana wrote:
> pol_elements = re.compile('id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:(?P<p_nat_status>\snat)\s(?P<p_nat_type>[^\s]+?)(?P<p_nat_ip>\sdip-id\s[^\s]+?)?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$'
> )


... and that's why we avoid regular expressions... it makes my head hurt 
just looking at that line noise.

Emile

[toc] | [prev] | [next] | [standalone]

#89579

From	MRAB <python@mrabarnett.plus.com>
Date	2015-04-29 22:28 +0100
Message-ID	<mailman.99.1430342891.3680.python-list@python.org>
In reply to	#89574

On 2015-04-29 22:22, Emile van Sebille wrote:
> On 4/29/2015 1:49 PM, Kashif Rana wrote:
>> pol_elements = re.compile('id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:(?P<p_nat_status>\snat)\s(?P<p_nat_type>[^\s]+?)(?P<p_nat_ip>\sdip-id\s[^\s]+?)?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$'
>> )
>
>
> ... and that's why we avoid regular expressions... it makes my head hurt
> just looking at that line noise.
>
It might just be easier to split it into a list of fields and then pick
out the ones you want:

fields = re.findall(r'"[^"]+"|\S+', line)

[toc] | [prev] | [next] | [standalone]

#89580

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2015-04-29 22:30 +0100
Message-ID	<mailman.100.1430343053.3680.python-list@python.org>
In reply to	#89574

On 29/04/2015 22:22, Emile van Sebille wrote:
> On 4/29/2015 1:49 PM, Kashif Rana wrote:
>> pol_elements =
>> re.compile('id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:(?P<p_nat_status>\snat)\s(?P<p_nat_type>[^\s]+?)(?P<p_nat_ip>\sdip-id\s[^\s]+?)?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$'
>>
>> )
>
>
> ... and that's why we avoid regular expressions... it makes my head hurt
> just looking at that line noise.
>
> Emile
>

Great minds think alike :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#89586

From	Tim Chase <python.list@tim.thechases.com>
Date	2015-04-29 17:38 -0500
Message-ID	<mailman.105.1430349368.3680.python-list@python.org>
In reply to	#89574

On 2015-04-29 14:22, Emile van Sebille wrote:
> On 4/29/2015 1:49 PM, Kashif Rana wrote:
> > pol_elements =
> > re.compile('id\s(?P<p_id>.+?)(?:\sname\s(?P<p_name>.+?))?\sfrom\s(?P<p_from>.+?)\sto\s(?P<p_to>.+?)\s{2}(?P<p_src>[^\s]+?)\s(?P<p_dst>[^\s]+?)\s(?P<p_port>[^\s]+?)(?:(?P<p_nat_status>\snat)\s(?P<p_nat_type>[^\s]+?)(?P<p_nat_ip>\sdip-id\s[^\s]+?)?)?\s(?P<p_action>[^\s]+?)(?:\sschedule\s(?P<p_schedule>[^\s]+?))?(?P<p_log_status>\slog)?$'
> > )
> 
> ... and that's why we avoid regular expressions... it makes my head
> hurt just looking at that line noise.

First, it appears the OP isn't using raw strings which make those
back-slashes just ask for trouble.

That said, it would be a lot better if the OP made use of re.VERBOSE
to put each component on its own line:

 pol_elements = re.compile(r"""
   id
   \s
  (?P<p_id>.+?)
  (?:
    \s
    name
    \s
    (?P<p_name>.+?)
    )?
  \s
  from
  \s
  (?P<p_from>.+?)
  \s
  to
  \s
  (?P<p_to>.+?)
  \s{2}
  (?P<p_src>[^\s]+?)
  \s
  (?P<p_dst>[^\s]+?)
  \s(?P<p_port>[^\s]+?)
  (?:
    \s
    (?P<p_nat_status>nat)
    \s
    (?P<p_nat_type>\w+)
    (
      \s?
      P<p_nat_src_ip>dip-id
      \s
      \d+   
      )?
    (
      \s
      ip
      \s
      (?P<p_nat_dst_ip>[\d\.]+)
      \s
      port
      (?P<dst_nat_port>\d+)
      )?
    )?
  \s
  (?P<p_action>[^\s]+?)
  (?:
    \s
    schedule
    \s
    (?P<p_schedule>[^\s]+?)
    )?
  (?P<p_log_status>\slog)?
  $
   """, re.VERBOSE)

which, with some copious comments in the expression, would make it
almost readable.

Alternatively, switch to an actual parser like pyparsing.

-tkc

[toc] | [prev] | [next] | [standalone]

#89587

From	sohcahtoa82@gmail.com
Date	2015-04-29 16:29 -0700
Message-ID	<fdc9aa77-9c75-4fff-8722-5c8a8057ca13@googlegroups.com>
In reply to	#89573

On Wednesday, April 29, 2015 at 1:42:18 PM UTC-7, Kashif Rana wrote:
> Hello Experts
> 
> I have below lines with some variations.
> 
> 1- set policy id 1000 from "Untrust" to "Trust" "Any" "1.1.1.1" "HTTP" nat dst ip 10.10.10.10 port 8000 permit log
> 
> 2- set policy id 5000 from "Trust" to "Untrust" "Any" "microsoft.com" "HTTP" nat src permit schedule "14August2014" log
> 
> 3- set policy id 7000 from "Trust" to "Untrust" "Users" "Any" "ANY" nat src dip-id 4 permit log
> 
> 4- set policy id 7000 from "Trust" to "Untrust" "servers" "Any" "ANY" deny
> 
> Please help me to write the regular expression to extract below information in parenthesis, if exist from each line. Please note that some items may exist or not like nat or log
> 
> set policy id (id) from (from) to (to) (source) (destination) (service) nat (src or dst) (dip-id 4) or (ip 10.10.10.10) port (dst-port) (action) schedule (schedule) (log)

If you don't have to worry about spaces in your strings, I'd just use split().  If you DO need to worry about spaces, it'd be trivial to write your own parser that stepped through the string a single character at a time.  The shlex module does this, but might not work for you.  I don't know how it would handle an IP address.

[toc] | [prev] | [next] | [standalone]

#95497

From	Paul McGuire <ptmcg@austin.rr.com>
Date	2015-08-19 12:53 -0700
Message-ID	<a459bee9-e3ed-4caf-a6dd-67823e818f3d@googlegroups.com>
In reply to	#89573

Here is a first shot at a pyparsing parser for these lines:

from pyparsing import *
SET,POLICY,ID,FROM,TO,NAT,SRC,DST,IP,PORT,SCHEDULE,LOG,PERMIT,ALLOW,DENY = map(CaselessKeyword,
    "SET,POLICY,ID,FROM,TO,NAT,SRC,DST,IP,PORT,SCHEDULE,LOG,PERMIT,ALLOW,DENY".split(','))

integer = Word(nums)
ipAddr = Combine(integer + ('.'+integer)*3)
quotedString.setParseAction(removeQuotes)

logParser = (SET + POLICY + ID + integer("id") + 
             FROM + quotedString("from_") + 
             TO + quotedString("to_") + quotedString("service"))


I run this with:

for line in """
1- set policy id 1000 from "Untrust" to "Trust" "Any" "1.1.1.1" "HTTP" nat dst ip 10.10.10.10 port 8000 permit log 

2- set policy id 5000 from "Trust" to "Untrust" "Any" "microsoft.com" "HTTP" nat src permit schedule "14August2014" log 

3- set policy id 7000 from "Trust" to "Untrust" "Users" "Any" "ANY" nat src dip-id 4 permit log 

4- set policy id 7000 from "Trust" to "Untrust" "servers" "Any" "ANY" deny 

""".splitlines():
    line = line.strip()
    if not line: continue
    print (integer + '-' + logParser).parseString(line).dump()
    print

Getting:

['1', '-', 'SET', 'POLICY', 'ID', '1000', 'FROM', 'Untrust', 'TO', 'Trust', 'Any']
- from_: Untrust
- id: 1000
- service: Any
- to_: Trust

['2', '-', 'SET', 'POLICY', 'ID', '5000', 'FROM', 'Trust', 'TO', 'Untrust', 'Any']
- from_: Trust
- id: 5000
- service: Any
- to_: Untrust

['3', '-', 'SET', 'POLICY', 'ID', '7000', 'FROM', 'Trust', 'TO', 'Untrust', 'Users']
- from_: Trust
- id: 7000
- service: Users
- to_: Untrust

['4', '-', 'SET', 'POLICY', 'ID', '7000', 'FROM', 'Trust', 'TO', 'Untrust', 'servers']
- from_: Trust
- id: 7000
- service: servers
- to_: Untrust


Pyparsing adds Optional classes so that you can include expressions for pieces that might be missing like "... + Optional(NAT + (SRC | DST)) + ..."

-- Paul

[toc] | [prev] | [standalone]

csiph-web

Python re to extract useful information from each line

Contents

#89573 — Python re to extract useful information from each line

#89574

#89578

#89579

#89580

#89586

#89587

#95497