Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Sender: benlast@gmail.com
From: Ben Last <ben@benlast.com>
Date: Wed, 17 Jul 2013 10:33:17 +0800
Subject: Re: grimace: a fluent regular expression generator in Python
To: python-list@python.org
Content-Type: multipart/alternative; boundary=089e013d1dc6bbc8af04e1abeb3d
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.4800.1374059632.3114.python-list@python.org>
Lines: 152
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:50786

--089e013d1dc6bbc8af04e1abeb3d
Content-Type: text/plain; charset=UTF-8

On 16 July 2013 20:48, <python-list-request@python.org> wrote:

> From: "Anders J. Munch" <2013@jmunch.dk>
> Date: Tue, 16 Jul 2013 13:38:35 +0200
> Ben Last wrote:
>
>> north_american_number_re = (RE().start
>> .literal('(').followed_by.**exactly(3).digits.then.**literal(')')
>>                                      .then.one.literal("-").then.**
>> exactly(3).digits
>> .then.one.dash.followed_by.**exactly(4).digits.then.end
>>                                      .as_string())
>>
>
> Very cool.  It's a bit verbose for my taste, and I'm not sure how well it
> will cope with nested structure.
>

I guess verbosity is the aim, in that *explicit is better than implicit* :)
 And I suppose that's one of the attributes of a fluent system; they tend
to need more typing.  It's not Perl...



> The problem with Perl-style regexp notation isn't so much that it's terse
> - it's that the syntax is irregular (sic) and doesn't follow modern
> principles for lexical structure in computer languages.  You can get a long
> way just by ignoring whitespace, putting literals in quotes and allowing
> embedded comments.
>

Good points.  I wanted to find a syntax that allows comments as well as
being fluent:

RE()
.any_number_of.digits  # Recall that any_number_of includes zero
.followed_by.an_optional.dot.then.at_least_one.digit  # The dot is
specifically optional
# but we must have one digit as a minimum
.as_string()

... and yes, I aso specifically wanted to have literals quoted.

Nested groups work, but I haven't tackled lookahead and backreferences :
essentially because if you're writing an RE that complex, you should
probably be working directly in RE strings.

Depending on what you mean by "nested", re-use of RE objects is easy
(example from the unit tests):

identifier_start_chars = RE().regex("[a-zA-Z_]")
identifier_chars = RE().regex("[a-zA-Z0-9_]")

self.assertEqual(RE().one_or_more.of(identifier_start_chars)
                     .followed_by.zero_or_more(identifier_chars)
                     .as_string(),
                     r"[a-zA-Z_]+[a-zA-Z0-9_]*")


Thanks for the comments!
ben

--089e013d1dc6bbc8af04e1abeb3d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On 16 July 2013 20:48,  <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:python-list-request@python.org" target=3D"_blank">python-list-request=
@python.org</a>&gt;</span> wrote:<div class=3D"gmail_extra"><div class=3D"g=
mail_quote">

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">From:=C2=A0&quot;Anders J. Munch&quot; &lt;<a href=3D"mail=
to:2013@jmunch.dk">2013@jmunch.dk</a>&gt;<br>

Date:=C2=A0Tue, 16 Jul 2013 13:38:35 +0200<br>Ben Last wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">
north_american_number_re =3D (RE().start<br>
.literal(&#39;(&#39;).followed_by.<u></u>exactly(3).digits.then.<u></u>lite=
ral(&#39;)&#39;)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.then.one.litera=
l(&quot;-&quot;).then.<u></u>exactly(3).digits<br>
.then.one.dash.followed_by.<u></u>exactly(4).digits.then.end<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.as_string())<br=
>
</blockquote>
<br>
Very cool. =C2=A0It&#39;s a bit verbose for my taste, and I&#39;m not sure =
how well it will cope with nested structure.<br></blockquote><div><br></div=
><div>I guess verbosity is the aim, in that <i>explicit is better than impl=
icit</i>=C2=A0:) =C2=A0And I suppose that&#39;s one of the attributes of a =
fluent system; they tend to need more typing. =C2=A0It&#39;s not Perl...</d=
iv>

<div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,20=
4,204);border-left-style:solid;padding-left:1ex">The problem with Perl-styl=
e regexp notation isn&#39;t so much that it&#39;s terse - it&#39;s that the=
 syntax is irregular (sic) and doesn&#39;t follow modern principles for lex=
ical structure in computer languages. =C2=A0You can get a long way just by =
ignoring whitespace, putting literals in quotes and allowing embedded comme=
nts.<br>

</blockquote><div><br></div><div>Good points. =C2=A0I wanted to find a synt=
ax that allows comments as well as being fluent:</div><div>=C2=A0</div><div=
><div><font face=3D"courier new, monospace">RE()</font></div><div><font fac=
e=3D"courier new, monospace">.any_number_of.digits =C2=A0# Recall that any_=
number_of includes zero=C2=A0</font></div>

<div><font face=3D"courier new, monospace">.followed_by.an_optional.dot.the=
n.at_least_one.digit =C2=A0# The dot is specifically optional</font></div><=
div><font face=3D"courier new, monospace"># but we must have one digit as a=
 minimum</font></div>

<div><font face=3D"courier new, monospace">.as_string()</font></div></div><=
div><br></div><div>... and yes, I aso specifically wanted to have literals =
quoted.</div><div><br></div><div>Nested groups work, but I haven&#39;t tack=
led lookahead and backreferences : essentially because if you&#39;re writin=
g an RE that complex, you should probably be working directly in RE strings=
.</div>

<div><br></div><div>Depending on what you mean by &quot;nested&quot;, re-us=
e of RE objects is easy (example from the unit tests):</div><div><br></div>=
<div><div><font face=3D"courier new, monospace">identifier_start_chars =3D =
RE().regex(&quot;[a-zA-Z_]&quot;)</font></div>

<div><font face=3D"courier new, monospace">identifier_chars =3D RE().regex(=
&quot;[a-zA-Z0-9_]&quot;)</font></div><div><font face=3D"courier new, monos=
pace"><br></font></div><div><font face=3D"courier new, monospace">self.asse=
rtEqual(RE().one_or_more.of(identifier_start_chars)</font></div>

<div><font face=3D"courier new, monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.followed_by.zero_or_more(iden=
tifier_chars)</font></div><div><font face=3D"courier new, monospace">=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.as_s=
tring(),</font></div><div><font face=3D"courier new, monospace">=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0r&quot;[a=
-zA-Z_]+[a-zA-Z0-9_]*&quot;)</font></div>

</div><div><br></div><div><br></div><div>Thanks for the comments!</div><div=
>ben</div><div><br></div></div><div dir=3D"ltr"></div>
</div></div>

--089e013d1dc6bbc8af04e1abeb3d--