Groups > comp.lang.python > #50743 > unrolled thread

Re: grimace: a fluent regular expression generator in Python

Started by	"Anders J. Munch" <2013@jmunch.dk>
First post	2013-07-16 13:38 +0200
Last post	2013-07-17 10:22 -0400
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: grimace: a fluent regular expression generator in Python "Anders J. Munch" <2013@jmunch.dk> - 2013-07-16 13:38 +0200
    Re: grimace: a fluent regular expression generator in Python Roy Smith <roy@panix.com> - 2013-07-17 10:22 -0400

#50743 — Re: grimace: a fluent regular expression generator in Python

From	"Anders J. Munch" <2013@jmunch.dk>
Date	2013-07-16 13:38 +0200
Subject	Re: grimace: a fluent regular expression generator in Python
Message-ID	<mailman.4772.1373978931.3114.python-list@python.org>

Ben Last wrote:
> north_american_number_re = (RE().start
>
> .literal('(').followed_by.exactly(3).digits.then.literal(')')
>                                      .then.one.literal("-").then.exactly(3).digits
>
> .then.one.dash.followed_by.exactly(4).digits.then.end
>                                      .as_string())

Very cool.  It's a bit verbose for my taste, and I'm not sure how well it will 
cope with nested structure.

Here's my take on what readable regexps could look like:

north_american_number_re = RE.compile(r"""
     ^
     "(" digit{3} ")"  # And why shouldn't a regexp
     "-" digit{3}      # include en embedded comment?
     "-" digit{4}
     $
""")

The problem with Perl-style regexp notation isn't so much that it's terse - it's 
that the syntax is irregular (sic) and doesn't follow modern principles for 
lexical structure in computer languages.  You can get a long way just by 
ignoring whitespace, putting literals in quotes and allowing embedded comments.

Setting the re.VERBOSE flag achieves two out of three, so you can write:

north_american_number_re = RE.compile(r"""
     ^
     ( \d{3} )   # Definite improvement, though I really miss putting
     - \d{3}     # literals in quotes.
     - \d{4}
     $
""")

It's too bad re.VERBOSE isn't the default.

regards, Anders

[toc] | [next] | [standalone]

#50794

From	Roy Smith <roy@panix.com>
Date	2013-07-17 10:22 -0400
Message-ID	<roy-B92B61.10220017072013@70-1-84-166.pools.spcsdns.net>
In reply to	#50743

In article <mailman.4772.1373978931.3114.python-list@python.org>,
 "Anders J. Munch" <2013@jmunch.dk> wrote:

> The problem with Perl-style regexp notation isn't so much that it's terse - 
> it's 
> that the syntax is irregular (sic) and doesn't follow modern principles for 
> lexical structure in computer languages.

There seem to be three basic ways to denote what's literal and what's 
not.

1) The Python (and C, Java, PHP, Fortran, etc) way, where all text is 
assumed to be evaluated as a language construct, unless explicitly 
quoted to make it a literal.

2) The shell way, where all text is assumed to be literal strings, 
unless explicitly marked with a $ (or other sigil) as a variable.

3) The regex way, where some characters are magic, but only sometimes 
(depending on context), and you just have to know which ones they are, 
and when, and can escape them to make them non-magic if you have to.

Where things get really messy is when you try to embed one language into 
another, such as regexes in Python.  Perl (and awk, from which it 
evolved) solves the problem in its own way by making regexes a built-in 
part of the language syntax.  Python goes in the other direction, and 
says regexes are just strings that you pass around.

> You can get a long way just by ignoring whitespace, putting literals 
> in quotes and allowing embedded comments.  Setting the re.VERBOSE 
> flag achieves two out of three [example elided].

Yup.  Here's a more complex example.  We use this to parse haproxy log 
files (probably going to munged a bit as lines get refolded by news 
software).  That would be insane without verbose mode (some might argue 
it's insane now, but that's another thread).

pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: '
                     r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):'
                     r'(?P<client_port>\d{1,5}) '

r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] '
                     r'(?P<frontend_name>\S+) '
                     r'(?P<backend_name>\S+)/'
                     r'(?P<server_name>\S+) '
                     r'(?P<Tq>(-1|\d+))/'
                     r'(?P<Tw>(-1|\d+))/'
                     r'(?P<Tc>(-1|\d+))/'
                     r'(?P<Tr>(-1|\d+))/'
                     r'(?P<Tt>\+?\d+) '
                     r'(?P<status_code>\d{3}) '
                     r'(?P<bytes_read>\d+) '
                     r'(?P<captured_request_cookie>\S+) '
                     r'(?P<captured_response_cookie>\S+) '
                     r'(?P<termination_state>[\w-]{4}) '
                     r'(?P<actconn>\d+)/'
                     r'(?P<feconn>\d+)/'
                     r'(?P<beconn>\d+)/'
                     r'(?P<srv_conn>\d+)/'
                     r'(?P<retries>\d+) '
                     r'(?P<srv_queue>\d+)/'
                     r'(?P<backend_queue>\d+) '
                     r'(\{(?P<request_id>.*?)\} )?'   # Comment this out 
for a stock haproxy (see above)
                     r'(\{(?P<captured_request_headers>.*?)\} )?'
                     r'(\{(?P<captured_response_headers>.*?)\} )?'
                     r'"(?P<http_request>.+)"'
                     )

[toc] | [prev] | [standalone]

csiph-web

Re: grimace: a fluent regular expression generator in Python

Contents

#50743 — Re: grimace: a fluent regular expression generator in Python

#50794