Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'essentially': 0.04; 'languages.': 0.04; 'syntax': 0.04; 'subject:Python': 0.06; 'explicit': 0.07; 'nested': 0.07; 'suppose': 0.07; 'attributes': 0.09; 'strings.': 0.09; 'system;': 0.09; 'email addr:python.org>': 0.11; 'cool.': 0.16; 'fluent': 0.16; 'ignoring': 0.16; 'irregular': 0.16; 'lexical': 0.16; 'literals': 0.16; 'notation': 0.16; 'optional': 0.16; 'subject:expression': 0.16; 'subject:generator': 0.16; 'subject:regular': 0.16; 'terse': 0.16; '\xc2\xa0i': 0.16; '\xc2\xa0you': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'bit': 0.19; 'skip:f 30': 0.19; 'work,': 0.20; '8bit%:5': 0.22; 'putting': 0.22; 'tend': 0.24; "haven't": 0.24; '+0200': 0.26; 'specifically': 0.29; "doesn't": 0.30; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'comments': 0.31; 'quotes': 0.31; 'allows': 0.31; 'probably': 0.32; 'guess': 0.33; 'date:': 0.34; 'problem': 0.35; 'objects': 0.35; 'but': 0.35; 'received:google.com': 0.35; '8bit%:9': 0.36; 'skip:s 60': 0.36; 'thanks': 0.36; 'should': 0.36; 'unit': 0.37; 'being': 0.38; 'minimum': 0.38; 'skip:& 10': 0.38; '8bit%:4': 0.38; 'ben': 0.38; 'skip:. 20': 0.38; 'to:addr:python-list': 0.38; 'embedded': 0.39; 'skip:. 10': 0.39; 'structure': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; '8bit%:6': 0.40; 'how': 0.40; 'easy': 0.60; 'skip:\xc2 10': 0.60; "you're": 0.61; 'july': 0.63; 'more': 0.64; 'skip:r 30': 0.69; 'jul': 0.74; '8bit%:24': 0.84; 'complex,': 0.84; 'regexp': 0.84; 'skip:. 50': 0.84; 'skip:. 60': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=DVkfjgHFRr6WHVbUzK8+V3b6+NwgMQut4R0SXbzh6ic=; b=cN47W1cQbgX2CX6xyjnjRk1gvkTtXNC+QGrvaq4q0PYhEqGACLsffHRmCzAKfKOfNd LJfn4bqK8al5x4jdk5py8Gz1ZHW6Pf2ubEMKhCIgZv2I94YTqtXcxZqhwpNEOnX9mWp+ RCFOyRgvEQGt8ZiGklWTuOwVX95ZSKuvCK1DLlNNOrSdUrq003CRo/EMIUo9lYekhM5+ 8L2wBEB41ISbsF2Jw+/C81Yl+Y6Z/bpF4zzMZIjd8Ajqtn6yr90CwnaFtzzB4XVXKuem 0nKgGWymTy0I9RO4THaPuLYbbwuwedP9F9eo/4jvx7Vr6UThxKw9ean3bgnxW7X0Pb3X LaJA== X-Received: by 10.194.178.138 with SMTP id cy10mr3222318wjc.61.1374028417450; Tue, 16 Jul 2013 19:33:37 -0700 (PDT) MIME-Version: 1.0 Sender: benlast@gmail.com From: Ben Last Date: Wed, 17 Jul 2013 10:33:17 +0800 X-Google-Sender-Auth: xfDckj6l0NkVX0w91aaDLhxAoHs Subject: Re: grimace: a fluent regular expression generator in Python To: python-list@python.org Content-Type: multipart/alternative; boundary=089e013d1dc6bbc8af04e1abeb3d X-Mailman-Approved-At: Wed, 17 Jul 2013 13:13:51 +0200 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 152 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1374059632 news.xs4all.nl 15921 [2001:888:2000:d::a6]:46746 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:50786 --089e013d1dc6bbc8af04e1abeb3d Content-Type: text/plain; charset=UTF-8 On 16 July 2013 20:48, wrote: > From: "Anders J. Munch" <2013@jmunch.dk> > Date: Tue, 16 Jul 2013 13:38:35 +0200 > Ben Last wrote: > >> north_american_number_re = (RE().start >> .literal('(').followed_by.**exactly(3).digits.then.**literal(')') >> .then.one.literal("-").then.** >> exactly(3).digits >> .then.one.dash.followed_by.**exactly(4).digits.then.end >> .as_string()) >> > > Very cool. It's a bit verbose for my taste, and I'm not sure how well it > will cope with nested structure. > I guess verbosity is the aim, in that *explicit is better than implicit* :) And I suppose that's one of the attributes of a fluent system; they tend to need more typing. It's not Perl... > The problem with Perl-style regexp notation isn't so much that it's terse > - it's that the syntax is irregular (sic) and doesn't follow modern > principles for lexical structure in computer languages. You can get a long > way just by ignoring whitespace, putting literals in quotes and allowing > embedded comments. > Good points. I wanted to find a syntax that allows comments as well as being fluent: RE() .any_number_of.digits # Recall that any_number_of includes zero .followed_by.an_optional.dot.then.at_least_one.digit # The dot is specifically optional # but we must have one digit as a minimum .as_string() ... and yes, I aso specifically wanted to have literals quoted. Nested groups work, but I haven't tackled lookahead and backreferences : essentially because if you're writing an RE that complex, you should probably be working directly in RE strings. Depending on what you mean by "nested", re-use of RE objects is easy (example from the unit tests): identifier_start_chars = RE().regex("[a-zA-Z_]") identifier_chars = RE().regex("[a-zA-Z0-9_]") self.assertEqual(RE().one_or_more.of(identifier_start_chars) .followed_by.zero_or_more(identifier_chars) .as_string(), r"[a-zA-Z_]+[a-zA-Z0-9_]*") Thanks for the comments! ben --089e013d1dc6bbc8af04e1abeb3d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On 16 July 2013 20:48, <python-list-request= @python.org> wrote:
From:=C2=A0"Anders J. Munch" <2013@jmunch.dk>
Date:=C2=A0Tue, 16 Jul 2013 13:38:35 +0200
Ben Last wrote:
north_american_number_re =3D (RE().start
.literal('(').followed_by.exactly(3).digits.then.lite= ral(')')
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.then.one.litera= l("-").then.exactly(3).digits
.then.one.dash.followed_by.exactly(4).digits.then.end
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.as_string())

Very cool. =C2=A0It's a bit verbose for my taste, and I'm not sure = how well it will cope with nested structure.

I guess verbosity is the aim, in that explicit is better than impl= icit=C2=A0:) =C2=A0And I suppose that's one of the attributes of a = fluent system; they tend to need more typing. =C2=A0It's not Perl...

=C2=A0
The problem with Perl-styl= e regexp notation isn't so much that it's terse - it's that the= syntax is irregular (sic) and doesn't follow modern principles for lex= ical structure in computer languages. =C2=A0You can get a long way just by = ignoring whitespace, putting literals in quotes and allowing embedded comme= nts.

Good points. =C2=A0I wanted to find a synt= ax that allows comments as well as being fluent:
=C2=A0
RE()
.any_number_of.digits =C2=A0# Recall that any_= number_of includes zero=C2=A0
.followed_by.an_optional.dot.the= n.at_least_one.digit =C2=A0# The dot is specifically optional
<= div># but we must have one digit as a= minimum
.as_string()
<= div>
... and yes, I aso specifically wanted to have literals = quoted.

Nested groups work, but I haven't tack= led lookahead and backreferences : essentially because if you're writin= g an RE that complex, you should probably be working directly in RE strings= .

Depending on what you mean by "nested", re-us= e of RE objects is easy (example from the unit tests):

=
identifier_start_chars =3D = RE().regex("[a-zA-Z_]")
identifier_chars =3D RE().regex(= "[a-zA-Z0-9_]")

self.asse= rtEqual(RE().one_or_more.of(identifier_start_chars)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.followed_by.zero_or_more(iden= tifier_chars)
=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0.as_s= tring(),
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0r"[a= -zA-Z_]+[a-zA-Z0-9_]*")


Thanks for the comments!
ben

--089e013d1dc6bbc8af04e1abeb3d--