Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #22904

Re: regexp(ing) Backus-Naurish expressions ...

From Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Newsgroups comp.lang.java.programmer
Subject Re: regexp(ing) Backus-Naurish expressions ...
Date 2013-03-11 17:00 -0500
Organization A noiseless patient Spider
Message-ID <khlk24$bqu$1@dont-email.me> (permalink)
References <khgr2k$u3b$1@speranza.aioe.org> <vo7pj8p9b0rv8pfkp64902inolmml9vm01@4ax.com> <aq4cssFsm5rU1@mid.individual.net> <ke3qj8daj46td8g3lklac14mtr56na4rv4@4ax.com>

Show all headers | View raw


On 3/10/2013 5:54 PM, Roedy Green wrote:
> Examples where regexes run out of steam:
> parsing Java, HTML, BAT language ... to do syntax colouring.

Actually, all of those examples fall under the category of lexing, which 
is very easy to do with regular expressions; the python equivalent of 
flex uses regular expressions internally to do the lexing. Basically, 
what you'd have to do is this:

1. For each token, compute the regex that matches the token and enclose 
it in a named capturing group
2. Combine the token regexes into a single regex using disjunctions
3. Run the large regex on the input string by continually finding 
matches until it runs out of them.
4. For each match, use the named capturing group to do actions for that 
part of the input string.

> screen scraping, where what you want can appear in arbiter orders, be
> missing, or enclosed in a variety of delimiters.

([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^ 
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically 
for tokenizing. Note in particular that I am matching two separate kinds 
of string literals ("foo" and [foo]). The hard part here is that I'm 
dealing with an idiot language that made comment-parsing context-free, 
but I decided to say "to hell with this" and ignore that fact, banking 
that it's a rare edge case I never have to deal with.

Granted, such large regular expressions can become extremely unwieldly 
(said regex is actually composed out of about five lines of code plus 
detailed comments above each part explaining what it does), but it's 
still very simple to do in a regex.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

regexp(ing) Backus-Naurish expressions ... qwertmonkey@syberianoutpost.ru - 2013-03-10 02:27 +0000
  Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-09 21:33 -0500
  Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-09 21:00 -0600
  Re: regexp(ing) Backus-Naurish expressions ... Leif Roar Moldskred <leifm@dimnakorr.com> - 2013-03-09 23:33 -0600
  Re: regexp(ing) Backus-Naurish expressions ... lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-10 10:27 +0000
  Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-10 12:55 +0000
  Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 07:57 -0700
    Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-10 22:39 +0100
      Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 15:54 -0700
        Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:03 +0100
        Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-11 17:00 -0500
          Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 18:31 -0400
            Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 18:40 -0400
              Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 21:39 -0400
          Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-11 23:06 +0000
          Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-11 20:56 -0500
            Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 22:06 -0400
            Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-12 09:30 -0400
      Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 16:24 -0700
        Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:08 +0100
          Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 16:59 -0400
            Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 22:24 +0100
      Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:00 +0100
    Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-13 08:07 +0100
  Re: regexp(ing) Backus-Naurish expressions ... markspace <markspace@nospam.nospam> - 2013-03-10 11:16 -0700

csiph-web