Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!eternal-september.org!feeder.eternal-september.org!mx05.eternal-september.org!.POSTED!not-for-mail From: =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?= Newsgroups: comp.lang.java.programmer Subject: Re: regexp(ing) Backus-Naurish expressions ... Date: Mon, 11 Mar 2013 17:00:04 -0500 Organization: A noiseless patient Spider Lines: 38 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 11 Mar 2013 21:58:28 +0000 (UTC) Injection-Info: mx05.eternal-september.org; posting-host="95ab4c362dcce54022543f80bbbb46d3"; logging-data="12126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+mgxfhXhfNa8wyiFrvVbntOt2sEwcdEo0=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Thunderbird/20.0 In-Reply-To: Cancel-Lock: sha1:vsxR+HF8xUZEJE+ozMTEXI5slrU= Xref: csiph.com comp.lang.java.programmer:22904 On 3/10/2013 5:54 PM, Roedy Green wrote: > Examples where regexes run out of steam: > parsing Java, HTML, BAT language ... to do syntax colouring. Actually, all of those examples fall under the category of lexing, which is very easy to do with regular expressions; the python equivalent of flex uses regular expressions internally to do the lexing. Basically, what you'd have to do is this: 1. For each token, compute the regex that matches the token and enclose it in a named capturing group 2. Combine the token regexes into a single regex using disjunctions 3. Run the large regex on the input string by continually finding matches until it runs out of them. 4. For each match, use the named capturing group to do actions for that part of the input string. > screen scraping, where what you want can appear in arbiter orders, be > missing, or enclosed in a variety of delimiters. ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^ \t\r\n()<>,:;@["])+ That is an example of a production regular expression I use specifically for tokenizing. Note in particular that I am matching two separate kinds of string literals ("foo" and [foo]). The hard part here is that I'm dealing with an idiot language that made comment-parsing context-free, but I decided to say "to hell with this" and ignore that fact, banking that it's a rare edge case I never have to deal with. Granted, such large regular expressions can become extremely unwieldly (said regex is actually composed out of about five lines of code plus detailed comments above each part explaining what it does), but it's still very simple to do in a regex. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth