Groups > comp.lang.python > #56356 > unrolled thread

Re for Apache log file format

Started by	Sam Giraffe <sam@giraffetech.biz>
First post	2013-10-07 23:33 -0700
Last post	2013-10-09 13:33 -0400
Articles	5 — 5 participants

Back to article view | Back to comp.lang.python

  Re for Apache log file format Sam Giraffe <sam@giraffetech.biz> - 2013-10-07 23:33 -0700
    Re: Re for Apache log file format Neil Cerutti <neilc@norwich.edu> - 2013-10-08 12:50 +0000
    Re: Re for Apache log file format Denis McMahon <denismfmcmahon@gmail.com> - 2013-10-08 15:48 +0000
      Re: Re for Apache log file format Skip Montanaro <skip@pobox.com> - 2013-10-08 10:59 -0500
    Re: Re for Apache log file format Piet van Oostrum <piet@vanoostrum.org> - 2013-10-09 13:33 -0400

#56356 — Re for Apache log file format

From	Sam Giraffe <sam@giraffetech.biz>
Date	2013-10-07 23:33 -0700
Subject	Re for Apache log file format
Message-ID	<mailman.832.1381215979.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

Hi,

I am trying to split up the re pattern for Apache log file format and seem
to be having some trouble in getting Python to understand multi-line
pattern:

#!/usr/bin/python

import re

#this is a single line
string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'

#trying to break up the pattern match for easy to read code
pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
                     r'(?P<ident>\-)\s+'
                     r'(?P<username>\-)\s+'
                     r'(?P<TZ>\[(.*?)\])\s+'
                     r'(?P<url>\"(.*?)\")\s+'
                     r'(?P<httpcode>\d{3})\s+'
                     r'(?P<size>\d+)\s+'
                     r'(?P<referrer>\"\")\s+'
                     r'(?P<agent>\((.*?)\))')

match = re.search(pattern, string)

if match:
    print match.group('ip')
else:
    print 'not found'

The python interpreter is skipping to the 'math = re.search' and then the
'if' statement right after it looks at the <ip>, instead of moving onto
<ident> and so on.

mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
> /Users/user/Documents/Python/apache.py(3)<module>()
-> import re
(Pdb) n
> /Users/user/Documents/Python/apache.py(5)<module>()
-> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET /
HTTP/1.0" 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
(Pdb) n
> /Users/user/Documents/Python/apache.py(7)<module>()
-> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
(Pdb) n
> /Users/user/Documents/Python/apache.py(17)<module>()
-> match = re.search(pattern, string)
(Pdb)

Thank you.

[toc] | [next] | [standalone]

#56392

From	Neil Cerutti <neilc@norwich.edu>
Date	2013-10-08 12:50 +0000
Message-ID	<bbidceF44feU1@mid.individual.net>
In reply to	#56356

On 2013-10-08, Sam Giraffe <sam@giraffetech.biz> wrote:
>
> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem
> to be having some trouble in getting Python to understand multi-line
> pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0"
> 302 276 "-" "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
>                      r'(?P<ident>\-)\s+'
>                      r'(?P<username>\-)\s+'
>                      r'(?P<TZ>\[(.*?)\])\s+'
>                      r'(?P<url>\"(.*?)\")\s+'
>                      r'(?P<httpcode>\d{3})\s+'
>                      r'(?P<size>\d+)\s+'
>                      r'(?P<referrer>\"\")\s+'
>                      r'(?P<agent>\((.*?)\))')

I recommend using the re.VERBOSE flag when explicating an re.
It'll make your life incrementally easier.

pattern = re.compile(
     r"""(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+
         (?P<ident>\-)\s+
         (?P<username>\-)\s+
         (?P<TZ>\[(.*?)\])\s+    # You can even insert comments.
         (?P<url>\"(.*?)\")\s+
         (?P<httpcode>\d{3})\s+
         (?P<size>\d+)\s+
         (?P<referrer>\"\")\s+
         (?P<agent>\((.*?)\))""", re.VERBOSE)

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]

#56421

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2013-10-08 15:48 +0000
Message-ID	<l319gm$cl6$2@dont-email.me>
In reply to	#56356

On Mon, 07 Oct 2013 23:33:31 -0700, Sam Giraffe wrote:

> I am trying to split up the re pattern for Apache log file format and
> seem to be having some trouble in getting Python to understand
> multi-line pattern:

Aiui apache log format uses space as delimiter, encapsulates strings in 
'"' characters, and uses '-' as an empty field.

So I think every element should match: (\S+|"[^"]+"|-) and there should 
be \s+ between elements.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#56425

From	Skip Montanaro <skip@pobox.com>
Date	2013-10-08 10:59 -0500
Message-ID	<mailman.862.1381247993.18130.python-list@python.org>
In reply to	#56421

> Aiui apache log format uses space as delimiter, encapsulates strings in
> '"' characters, and uses '-' as an empty field.

Specifying the field delimiter as a space, you might be able to use
the csv module to read these. I haven't done any Apache log file work
since long before the csv module was available, but it just might
work.

Skip

[toc] | [prev] | [next] | [standalone]

#56506

From	Piet van Oostrum <piet@vanoostrum.org>
Date	2013-10-09 13:33 -0400
Message-ID	<m2iox6xsbp.fsf@cochabamba.vanoostrum.org>
In reply to	#56356

Sam Giraffe <sam@giraffetech.biz> writes:

> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem to be having some
> trouble in getting Python to understand multi-line pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/
> v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
>                      r'(?P<ident>\-)\s+'
>                      r'(?P<username>\-)\s+'
>                      r'(?P<TZ>\[(.*?)\])\s+'
>                      r'(?P<url>\"(.*?)\")\s+'
>                      r'(?P<httpcode>\d{3})\s+'
>                      r'(?P<size>\d+)\s+'
>                      r'(?P<referrer>\"\")\s+'
>                      r'(?P<agent>\((.*?)\))')
>
> match = re.search(pattern, string)
>
> if match:
>     print match.group('ip')
> else:
>     print 'not found'
>
> The python interpreter is skipping to the 'math = re.search' and then the 'if' statement right
> after it looks at the <ip>, instead of moving onto <ident> and so on.

Although you have written the regexp as a sequence of lines, in reality it is a single string, and therefore pdb will do only a single step, and not go into its "parts", which really are not parts.
>
> mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
>> /Users/user/Documents/Python/apache.py(3)<module>()
> -> import re
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(5)<module>()
> -> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-"
> "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(7)<module>()
> -> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(17)<module>()
> -> match = re.search(pattern, string)
> (Pdb)

Also as Andreas has noted the r'(?P<referrer>\"\")\s+' part is wrong. It should probably be 
r'(?P<referrer>\".*?\")\s+'

And the r'(?P<agent>\((.*?)\))') will also not match as there is text outside the (). Should probably also be
r'(?P<agent>\".*?\")') or something like it.
-- 
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

[toc] | [prev] | [standalone]

csiph-web

Re for Apache log file format

Contents

#56356 — Re for Apache log file format

#56392

#56421

#56425

#56506