Groups > comp.lang.java.softwaretools > #16 > unrolled thread

Regex tools

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2011-03-30 03:27 -0700
Last post	2011-05-27 21:38 +0200
Articles	4 — 4 participants

Back to article view | Back to comp.lang.java.softwaretools

  Regex tools Roedy Green <see_website@mindprod.com.invalid> - 2011-03-30 03:27 -0700
    Re: Regex tools David Lamb <dalamb@cs.queensu.ca> - 2011-03-30 16:31 -0400
    Re: Regex tools terry0k <terryok00@gmail.com> - 2011-05-12 21:01 +0800
    Re: Regex tools Robert Klemme <shortcutter@googlemail.com> - 2011-05-27 21:38 +0200

#16 — Regex tools

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-03-30 03:27 -0700
Subject	Regex tools
Message-ID	<9516p69657r0vi887067gnu632f0hvmrbv@4ax.com>

has anyone ever thought of writing a regex debugging frame.

You would give it lists of lines that should match and lists of lines
that should not.

It then shows you the furthest offset in the regex it managed to match
to for each line, and verifies lines that should match matched (and
the group(i), and that lines that did not match did not.

Once you have mastered that, consider writing a automatic regex
composer.  Just from the examples it composes a regex.  You then add
more examples to test the regex, or modify it by hand. Perhaps it even
generates more examples to clarify your intentions.  You iterate till
you have a working regex.

The thing that bothers me is I am never sure a regex is fully
debugged.

-- 
Roedy Green Canadian Mind Products
http://mindprod.com
There are only two industries that refer to their customers as "users".
~ Edward Tufte

[toc] | [next] | [standalone]

#17

From	David Lamb <dalamb@cs.queensu.ca>
Date	2011-03-30 16:31 -0400
Message-ID	<GqMkp.1583$g56.573@newsfe04.iad>
In reply to	#16

On 30/03/2011 6:27 AM, Roedy Green wrote:
> has anyone ever thought of writing a regex debugging frame.
> The thing that bothers me is I am never sure a regex is fully
> debugged.

It has been far too long since I've looked at finite automata theory, 
but I suspect somebody already develiped some algorithm generating a 
minimal set of inputs that drives a deterministic finite state machine 
through some suitable set of paths through its states.  (I'm thinking of 
needing only a few of test cases for (a)*: zero a's, one, and either 2 
or some larger number.

But regexes in programming languages are lots more complex than the ones 
usually used in theory (where all you get is primitives like single 
symbol, sequencing like ab, repetition like a*, and alternation like a|b.

[toc] | [prev] | [next] | [standalone]

#30

From	terry0k <terryok00@gmail.com>
Date	2011-05-12 21:01 +0800
Message-ID	<4dcbda12$0$13394$afc38c87@news.optusnet.com.au>
In reply to	#16

On 03/30/2011 06:27 PM, Roedy Green wrote:
> has anyone ever thought of writing a regex debugging frame.
>
> You would give it lists of lines that should match and lists of lines
> that should not.
>
> It then shows you the furthest offset in the regex it managed to match
> to for each line, and verifies lines that should match matched (and
> the group(i), and that lines that did not match did not.
>
> Once you have mastered that, consider writing a automatic regex
> composer.  Just from the examples it composes a regex.  You then add
> more examples to test the regex, or modify it by hand. Perhaps it even
> generates more examples to clarify your intentions.  You iterate till
> you have a working regex.
>
> The thing that bothers me is I am never sure a regex is fully
> debugged.
>
Roedy/guys,
I am sort of dealing in this area. My project is based on a Swing GUI- 
'TelFormFactory' which generates an XML Schema which defines the 
allowable content in a data form. It then generates 'TelForms' which 
allow input, data-checking and sending of data to 'TelFormHost'. 
Currently TelForm clients are Application and Applet (J2SE1.2 for market 
breadth) and Midlet CLDC 1.1 and midp1. All 3 takes about 10 seconds.

The W3C XML Schema standard allows for a subset of the Perl (and Java) 
RegEx to restrict, or otherwise re-define, the base datatype of the 
TelForm field. TelFormFactory has a facility to input a RegEx and test 
data against this. I have about 4 XML Schema validators and this is 
definitely an area that needs more work. I have spent some time on 
'RegEx Builder' which allows trialling interim expressions. I haven't 
looked at this for a year, and largely switched it off but if people are 
interested I will put it out. All feedback welcome.
See www.terry-comms.com

           Cheers,
                              Terry O'K.

[toc] | [prev] | [next] | [standalone]

#34

From	Robert Klemme <shortcutter@googlemail.com>
Date	2011-05-27 21:38 +0200
Message-ID	<94acthF4dgU1@mid.individual.net>
In reply to	#16

On 30.03.2011 12:27, Roedy Green wrote:
> has anyone ever thought of writing a regex debugging frame.

There are tools out there which let you debug a regular expression. 
This also greatly helps understanding how matching proceeds.  Turns out 
I recommended Regexp Coach to you already:

http://www.velocityreviews.com/forums/t648698-debugging-regex.html

> You would give it lists of lines that should match and lists of lines
> that should not.
>
> It then shows you the furthest offset in the regex it managed to match
> to for each line, and verifies lines that should match matched (and
> the group(i), and that lines that did not match did not.

Well, basically you can easily write a short program which throws a 
number of texts against a regular expression and spits out matches, 
their positions etc.

> Once you have mastered that, consider writing a automatic regex
> composer.  Just from the examples it composes a regex.  You then add
> more examples to test the regex, or modify it by hand. Perhaps it even
> generates more examples to clarify your intentions.  You iterate till
> you have a working regex.

This is basically impossible to do automatically.  The formal reason is 
that all repetition operators match indefinite many strings.  The more 
practical reason is that for a number of inputs there are likely 
multiple patterns which match.  How do you want to decide whether "abc" 
and "aab" were intended to be matched by /a+b+c*/ or /a+\w+/ or even 
/\w+/?  The story becomes more complicated when adding groups to the 
mix.  You would at least have to identify corresponding regions in all 
strings which you provide which soon will get messy.  Frankly, I'd 
rather be writing expressions by hand - much quicker.

> The thing that bothers me is I am never sure a regex is fully
> debugged.

Typically you would write proper tests for matches and mismatches which 
especially ensure that "near matches" do not accidentally match.  Other 
than that probably no piece of software is fully debugged - ever.  You 
need to test until you have gained enough confidence.  Well, with 
regular expressions it may actually be possible to formally prove that 
they are correct - if you can equally formally specify the possible 
input.  I wouldn't bother to spend the efforts; for me testing produces 
enough confidence, because I try to carefully pick test cases.

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [standalone]

csiph-web

Regex tools

Contents

#16 — Regex tools

#17

#30

#34