Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail
From: =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?= <Pidgeot18@verizon.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: regexp(ing) Backus-Naurish expressions ...
Date: Mon, 11 Mar 2013 17:00:04 -0500
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <khlk24$bqu$1@dont-email.me>
References: <khgr2k$u3b$1@speranza.aioe.org> <vo7pj8p9b0rv8pfkp64902inolmml9vm01@4ax.com> <aq4cssFsm5rU1@mid.individual.net> <ke3qj8daj46td8g3lklac14mtr56na4rv4@4ax.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Mar 2013 21:58:28 +0000 (UTC)
Injection-Info: mx05.eternal-september.org; posting-host="95ab4c362dcce54022543f80bbbb46d3"; logging-data="12126"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+mgxfhXhfNa8wyiFrvVbntOt2sEwcdEo0="
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Thunderbird/20.0
In-Reply-To: <ke3qj8daj46td8g3lklac14mtr56na4rv4@4ax.com>
Cancel-Lock: sha1:vsxR+HF8xUZEJE+ozMTEXI5slrU=
Xref: csiph.com comp.lang.java.programmer:22904

On 3/10/2013 5:54 PM, Roedy Green wrote:
> Examples where regexes run out of steam:
> parsing Java, HTML, BAT language ... to do syntax colouring.

Actually, all of those examples fall under the category of lexing, which 
is very easy to do with regular expressions; the python equivalent of 
flex uses regular expressions internally to do the lexing. Basically, 
what you'd have to do is this:

1. For each token, compute the regex that matches the token and enclose 
it in a named capturing group
2. Combine the token regexes into a single regex using disjunctions
3. Run the large regex on the input string by continually finding 
matches until it runs out of them.
4. For each match, use the named capturing group to do actions for that 
part of the input string.

> screen scraping, where what you want can appear in arbiter orders, be
> missing, or enclosed in a variety of delimiters.

([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^ 
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically 
for tokenizing. Note in particular that I am matching two separate kinds 
of string literals ("foo" and [foo]). The hard part here is that I'm 
dealing with an idiot language that made comment-parsing context-free, 
but I decided to say "to hell with this" and ignore that fact, banking 
that it's a rare edge case I never have to deal with.

Granted, such large regular expressions can become extremely unwieldly 
(said regex is actually composed out of about five lines of code plus 
detailed comments above each part explaining what it does), but it's 
still very simple to do in a regex.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth