Re: In defence of 80-char lines

Path	csiph.com!usenet.pasdenom.info!aioe.org!news.mixmin.net!feeder.erje.net!eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From	Neil Cerutti <neilc@norwich.edu>
Newsgroups	comp.lang.python
Subject	Re: In defence of 80-char lines
Date	4 Apr 2013 15:56:56 GMT
Organization	Norwich University
Lines	143
Message-ID	<as5m68Fe3ttU1@mid.individual.net> (permalink)
References	<515cd919$0$29966$c3e8da3$5496439d@news.astraweb.com> <mailman.96.1365077619.3114.python-list@python.org> <roy-D6F29A.08394604042013@news.panix.com>
Mime-Version	1.0
Content-Type	text/plain; charset=us-ascii
Content-Transfer-Encoding	7bit
X-Trace	individual.net aby/mlQC2kyx4+pMlWLz6wXDyscOLLTJqUvc1wGsVBwM0yuNwE
Cancel-Lock	sha1:Hpx/5UlVZC7aGQkH28l/3c8EvQs=
User-Agent	slrn/0.9.9p1/mm/ao (Win32)
Xref	csiph.com comp.lang.python:42762

Show key headers only | View raw

On 2013-04-04, Roy Smith <roy@panix.com> wrote:
> re.X is a pretty cool tool for making huge regexes readable.
> But, it turns out that python's auto-continuation and string
> literal concatenation rules are enough to let you get much the
> same effect.  Here's a regex we use to parse haproxy log files.
> This would be utter line noise all run together. This way, it's
> almost readable :-)
>
> pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: '
>                      r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):'
>                      r'(?P<client_port>\d{1,5}) '
>                      
> r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] '
>                      r'(?P<frontend_name>\S+) '
>                      r'(?P<backend_name>\S+)/'
>                      r'(?P<server_name>\S+) '
>                      r'(?P<Tq>(-1|\d+))/'
>                      r'(?P<Tw>(-1|\d+))/'
>                      r'(?P<Tc>(-1|\d+))/'
>                      r'(?P<Tr>(-1|\d+))/'
>                      r'(?P<Tt>\+?\d+) '
>                      r'(?P<status_code>\d{3}) '
>                      r'(?P<bytes_read>\d+) '
>                      r'(?P<captured_request_cookie>\S+) '
>                      r'(?P<captured_response_cookie>\S+) '
>                      r'(?P<termination_state>[\w-]{4}) '
>                      r'(?P<actconn>\d+)/'
>                      r'(?P<feconn>\d+)/'
>                      r'(?P<beconn>\d+)/'
>                      r'(?P<srv_conn>\d+)/'
>                      r'(?P<retries>\d+) '
>                      r'(?P<srv_queue>\d+)/'
>                      r'(?P<backend_queue>\d+) '
>                      r'(\{(?P<request_id>.*?)\} )?'
>                      r'(\{(?P<captured_request_headers>.*?)\} )?'
>                      r'(\{(?P<captured_response_headers>.*?)\} )?'
>                      r'"(?P<http_request>.+)"'
>                      )
>
> And, for those of you who go running in the other direction every time 
> regex is suggested as a solution, I challenge you to come up with easier 
> to read (or write) code for parsing a line like this (probably 
> hopelessly mangled by the time you read it):
>
> 2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291 
> [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com 
> 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA 
> sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0 
> {4C0ABFA9-515B6DEF-933229} "POST 
> /api/1/station/892337/song/16024201/notify-play HTTP/1.0"

The big win from the above seems to me the groupdict result. The
parsing is also very simple, with virtually no nesting. It's a
good application of re. 

It seems easy enough to do with str methods, but would it be an
improvement?

I ran out of time before the prototype was finished, but here's a
sketch.


import re
import datetime
import pprint

s =('2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291'
    ' [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com'
    ' 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA'
    ' sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0'
    ' {4C0ABFA9-515B6DEF-933229}'
    ' "POST /api/1/station/892337/song/16024201/notify-play HTTP/1.0"')

def get_haproxy(s):
    prefix = 'haproxy['
    if s.startswith(prefix):
        return int(s[len(prefix):s.index(']')])
    return False

def get_client_info(s):
    ip, colon, port = s.partition(':')
    if colon != ':':
        return False
    else:
        return ip, int(port)

def get_accept_date(s):
    try:
        return datetime.datetime.strptime(s, '[%d/%b/%Y:%H:%M:%S.%f]')
    except ValueError:
        return False

def get_backend(s):
    name, slash, server = s.partition('/')
    if slash != '/':
        return False
    else:
        return name, server

def get_track_info(s):
    try:
        return s.split('/')
    except TypeError:
        return False

matchers = [
        (None, None),
        (None, 'localhost'),
        ('haproxy', get_haproxy),
        (('client_ip', 'client_port'), get_client_info), 
        ('accept_date', get_accept_date),
        ('frontend_name', lambda s: s),
        (('backend_name', 'server_name'), get_backend),
        (('Tq', 'Tw', 'Tc', 'Tr', 'Tt'), get_track_info),
        ]
result = {}

for i, s in enumerate(s.split()):
    if i < len(matchers): # I'm not finished writing matchers yet.
        key, matcher = matchers[i]
        if matcher is None:
            pass
        else:
            if isinstance(matcher, str):
                value = matcher == s
            else:
                value = matcher(s)
            if value is False:
                raise ValueError('Parse error {}: {} "{}"'.format(
                    key, matcher, s))
            if isinstance(key, tuple):
                result.update(zip(*[key, value]))
            elif key is not None:
                result[key] = value
pprint.pprint(result)

The engine would need to be improved in implementation and made
more flexible once it's working and tested. I think the error
handling is a good feature and the ability to customize parsing
and return custom types is cool.

-- 
Neil Cerutti

Thread

In defence of 80-char lines Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-04 01:36 +0000
  Re: In defence of 80-char lines Andrew Berg <bahamutzero8825@gmail.com> - 2013-04-03 20:59 -0500
  Re: In defence of 80-char lines Mitya Sirenef <msirenef@lightbird.net> - 2013-04-03 22:40 -0400
  Re: In defence of 80-char lines llanitedave <llanitedave@veawb.coop> - 2013-04-03 21:32 -0700
    Re: In defence of 80-char lines Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-04 06:33 +0000
    Re: In defence of 80-char lines gregor <gregor@ediwo.com> - 2013-04-04 08:40 +0200
    Re: In defence of 80-char lines Peter Otten <__peter__@web.de> - 2013-04-04 08:43 +0200
    Re: In defence of 80-char lines Tim Chase <python.list@tim.thechases.com> - 2013-04-04 06:09 -0500
    Re: In defence of 80-char lines Roy Smith <roy@panix.com> - 2013-04-04 07:52 -0400
      Re: In defence of 80-char lines llanitedave <llanitedave@veawb.coop> - 2013-04-04 08:28 -0700
    Re: In defence of 80-char lines Jason Swails <jason.swails@gmail.com> - 2013-04-04 08:18 -0400
    Re: In defence of 80-char lines Joshua Landau <joshua.landau.ws@gmail.com> - 2013-04-04 18:18 +0100
    Re: In defence of 80-char lines Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-04-04 18:22 -0400
  Re: In defence of 80-char lines rusi <rustompmody@gmail.com> - 2013-04-03 21:56 -0700
  Re: In defence of 80-char lines Rui Maciel <rui.maciel@gmail.com> - 2013-04-04 08:15 +0100
  Re: In defence of 80-char lines Jason Swails <jason.swails@gmail.com> - 2013-04-04 08:13 -0400
    Re: In defence of 80-char lines Roy Smith <roy@panix.com> - 2013-04-04 08:39 -0400
      Re: In defence of 80-char lines Jason Swails <jason.swails@gmail.com> - 2013-04-04 09:23 -0400
      Re: In defence of 80-char lines Neil Cerutti <neilc@norwich.edu> - 2013-04-04 15:56 +0000
      Re: In defence of 80-char lines Kushal Kumaran <kushal.kumaran+python@gmail.com> - 2013-04-04 23:04 +0530
        Re: In defence of 80-char lines Roy Smith <roy@panix.com> - 2013-04-04 19:55 -0400
  Re: In defence of 80-char lines Mitya Sirenef <msirenef@lightbird.net> - 2013-04-04 12:12 -0400
  Re: In defence of 80-char lines jmfauth <wxjmfauth@gmail.com> - 2013-04-04 13:28 -0700
    Re: In defence of 80-char lines Jason Swails <jason.swails@gmail.com> - 2013-04-04 17:00 -0400
    Re: In defence of 80-char lines Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-04-05 00:14 +0000

csiph-web