Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #56506

Re: Re for Apache log file format

From Piet van Oostrum <piet@vanoostrum.org>
Newsgroups comp.lang.python
Subject Re: Re for Apache log file format
Date 2013-10-09 13:33 -0400
Message-ID <m2iox6xsbp.fsf@cochabamba.vanoostrum.org> (permalink)
References <mailman.832.1381215979.18130.python-list@python.org>

Show all headers | View raw


Sam Giraffe <sam@giraffetech.biz> writes:

> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem to be having some
> trouble in getting Python to understand multi-line pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/
> v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
>                      r'(?P<ident>\-)\s+'
>                      r'(?P<username>\-)\s+'
>                      r'(?P<TZ>\[(.*?)\])\s+'
>                      r'(?P<url>\"(.*?)\")\s+'
>                      r'(?P<httpcode>\d{3})\s+'
>                      r'(?P<size>\d+)\s+'
>                      r'(?P<referrer>\"\")\s+'
>                      r'(?P<agent>\((.*?)\))')
>
> match = re.search(pattern, string)
>
> if match:
>     print match.group('ip')
> else:
>     print 'not found'
>
> The python interpreter is skipping to the 'math = re.search' and then the 'if' statement right
> after it looks at the <ip>, instead of moving onto <ident> and so on.

Although you have written the regexp as a sequence of lines, in reality it is a single string, and therefore pdb will do only a single step, and not go into its "parts", which really are not parts.
>
> mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
>> /Users/user/Documents/Python/apache.py(3)<module>()
> -> import re
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(5)<module>()
> -> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-"
> "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(7)<module>()
> -> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(17)<module>()
> -> match = re.search(pattern, string)
> (Pdb)

Also as Andreas has noted the r'(?P<referrer>\"\")\s+' part is wrong. It should probably be 
r'(?P<referrer>\".*?\")\s+'

And the r'(?P<agent>\((.*?)\))') will also not match as there is text outside the (). Should probably also be
r'(?P<agent>\".*?\")') or something like it.
-- 
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Re for Apache log file format Sam Giraffe <sam@giraffetech.biz> - 2013-10-07 23:33 -0700
  Re: Re for Apache log file format Neil Cerutti <neilc@norwich.edu> - 2013-10-08 12:50 +0000
  Re: Re for Apache log file format Denis McMahon <denismfmcmahon@gmail.com> - 2013-10-08 15:48 +0000
    Re: Re for Apache log file format Skip Montanaro <skip@pobox.com> - 2013-10-08 10:59 -0500
  Re: Re for Apache log file format Piet van Oostrum <piet@vanoostrum.org> - 2013-10-09 13:33 -0400

csiph-web