Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #22904
| From | Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: regexp(ing) Backus-Naurish expressions ... |
| Date | 2013-03-11 17:00 -0500 |
| Organization | A noiseless patient Spider |
| Message-ID | <khlk24$bqu$1@dont-email.me> (permalink) |
| References | <khgr2k$u3b$1@speranza.aioe.org> <vo7pj8p9b0rv8pfkp64902inolmml9vm01@4ax.com> <aq4cssFsm5rU1@mid.individual.net> <ke3qj8daj46td8g3lklac14mtr56na4rv4@4ax.com> |
On 3/10/2013 5:54 PM, Roedy Green wrote:
> Examples where regexes run out of steam:
> parsing Java, HTML, BAT language ... to do syntax colouring.
Actually, all of those examples fall under the category of lexing, which
is very easy to do with regular expressions; the python equivalent of
flex uses regular expressions internally to do the lexing. Basically,
what you'd have to do is this:
1. For each token, compute the regex that matches the token and enclose
it in a named capturing group
2. Combine the token regexes into a single regex using disjunctions
3. Run the large regex on the input string by continually finding
matches until it runs out of them.
4. For each match, use the named capturing group to do actions for that
part of the input string.
> screen scraping, where what you want can appear in arbiter orders, be
> missing, or enclosed in a variety of delimiters.
([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
\t\r\n()<>,:;@["])+
That is an example of a production regular expression I use specifically
for tokenizing. Note in particular that I am matching two separate kinds
of string literals ("foo" and [foo]). The hard part here is that I'm
dealing with an idiot language that made comment-parsing context-free,
but I decided to say "to hell with this" and ignore that fact, banking
that it's a rare edge case I never have to deal with.
Granted, such large regular expressions can become extremely unwieldly
(said regex is actually composed out of about five lines of code plus
detailed comments above each part explaining what it does), but it's
still very simple to do in a regex.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
regexp(ing) Backus-Naurish expressions ... qwertmonkey@syberianoutpost.ru - 2013-03-10 02:27 +0000
Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-09 21:33 -0500
Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> - 2013-03-09 21:00 -0600
Re: regexp(ing) Backus-Naurish expressions ... Leif Roar Moldskred <leifm@dimnakorr.com> - 2013-03-09 23:33 -0600
Re: regexp(ing) Backus-Naurish expressions ... lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-10 10:27 +0000
Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-10 12:55 +0000
Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 07:57 -0700
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-10 22:39 +0100
Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 15:54 -0700
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:03 +0100
Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> - 2013-03-11 17:00 -0500
Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 18:31 -0400
Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 18:40 -0400
Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 21:39 -0400
Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-11 23:06 +0000
Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer đ§ <Pidgeot18@verizon.invalid> - 2013-03-11 20:56 -0500
Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 22:06 -0400
Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-12 09:30 -0400
Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 16:24 -0700
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:08 +0100
Re: regexp(ing) Backus-Naurish expressions ... Arne VajhĂžj <arne@vajhoej.dk> - 2013-03-11 16:59 -0400
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 22:24 +0100
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:00 +0100
Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-13 08:07 +0100
Re: regexp(ing) Backus-Naurish expressions ... markspace <markspace@nospam.nospam> - 2013-03-10 11:16 -0700
csiph-web