Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50743 > unrolled thread
| Started by | "Anders J. Munch" <2013@jmunch.dk> |
|---|---|
| First post | 2013-07-16 13:38 +0200 |
| Last post | 2013-07-17 10:22 -0400 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: grimace: a fluent regular expression generator in Python "Anders J. Munch" <2013@jmunch.dk> - 2013-07-16 13:38 +0200
Re: grimace: a fluent regular expression generator in Python Roy Smith <roy@panix.com> - 2013-07-17 10:22 -0400
| From | "Anders J. Munch" <2013@jmunch.dk> |
|---|---|
| Date | 2013-07-16 13:38 +0200 |
| Subject | Re: grimace: a fluent regular expression generator in Python |
| Message-ID | <mailman.4772.1373978931.3114.python-list@python.org> |
Ben Last wrote:
> north_american_number_re = (RE().start
>
> .literal('(').followed_by.exactly(3).digits.then.literal(')')
> .then.one.literal("-").then.exactly(3).digits
>
> .then.one.dash.followed_by.exactly(4).digits.then.end
> .as_string())
Very cool. It's a bit verbose for my taste, and I'm not sure how well it will
cope with nested structure.
Here's my take on what readable regexps could look like:
north_american_number_re = RE.compile(r"""
^
"(" digit{3} ")" # And why shouldn't a regexp
"-" digit{3} # include en embedded comment?
"-" digit{4}
$
""")
The problem with Perl-style regexp notation isn't so much that it's terse - it's
that the syntax is irregular (sic) and doesn't follow modern principles for
lexical structure in computer languages. You can get a long way just by
ignoring whitespace, putting literals in quotes and allowing embedded comments.
Setting the re.VERBOSE flag achieves two out of three, so you can write:
north_american_number_re = RE.compile(r"""
^
( \d{3} ) # Definite improvement, though I really miss putting
- \d{3} # literals in quotes.
- \d{4}
$
""")
It's too bad re.VERBOSE isn't the default.
regards, Anders
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-07-17 10:22 -0400 |
| Message-ID | <roy-B92B61.10220017072013@70-1-84-166.pools.spcsdns.net> |
| In reply to | #50743 |
In article <mailman.4772.1373978931.3114.python-list@python.org>,
"Anders J. Munch" <2013@jmunch.dk> wrote:
> The problem with Perl-style regexp notation isn't so much that it's terse -
> it's
> that the syntax is irregular (sic) and doesn't follow modern principles for
> lexical structure in computer languages.
There seem to be three basic ways to denote what's literal and what's
not.
1) The Python (and C, Java, PHP, Fortran, etc) way, where all text is
assumed to be evaluated as a language construct, unless explicitly
quoted to make it a literal.
2) The shell way, where all text is assumed to be literal strings,
unless explicitly marked with a $ (or other sigil) as a variable.
3) The regex way, where some characters are magic, but only sometimes
(depending on context), and you just have to know which ones they are,
and when, and can escape them to make them non-magic if you have to.
Where things get really messy is when you try to embed one language into
another, such as regexes in Python. Perl (and awk, from which it
evolved) solves the problem in its own way by making regexes a built-in
part of the language syntax. Python goes in the other direction, and
says regexes are just strings that you pass around.
> You can get a long way just by ignoring whitespace, putting literals
> in quotes and allowing embedded comments. Setting the re.VERBOSE
> flag achieves two out of three [example elided].
Yup. Here's a more complex example. We use this to parse haproxy log
files (probably going to munged a bit as lines get refolded by news
software). That would be insane without verbose mode (some might argue
it's insane now, but that's another thread).
pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: '
r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):'
r'(?P<client_port>\d{1,5}) '
r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] '
r'(?P<frontend_name>\S+) '
r'(?P<backend_name>\S+)/'
r'(?P<server_name>\S+) '
r'(?P<Tq>(-1|\d+))/'
r'(?P<Tw>(-1|\d+))/'
r'(?P<Tc>(-1|\d+))/'
r'(?P<Tr>(-1|\d+))/'
r'(?P<Tt>\+?\d+) '
r'(?P<status_code>\d{3}) '
r'(?P<bytes_read>\d+) '
r'(?P<captured_request_cookie>\S+) '
r'(?P<captured_response_cookie>\S+) '
r'(?P<termination_state>[\w-]{4}) '
r'(?P<actconn>\d+)/'
r'(?P<feconn>\d+)/'
r'(?P<beconn>\d+)/'
r'(?P<srv_conn>\d+)/'
r'(?P<retries>\d+) '
r'(?P<srv_queue>\d+)/'
r'(?P<backend_queue>\d+) '
r'(\{(?P<request_id>.*?)\} )?' # Comment this out
for a stock haproxy (see above)
r'(\{(?P<captured_request_headers>.*?)\} )?'
r'(\{(?P<captured_response_headers>.*?)\} )?'
r'"(?P<http_request>.+)"'
)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web