Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #14814

Re: simple regex pattern sought

Path csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail
From markspace <-@.>
Newsgroups comp.lang.java.programmer
Subject Re: simple regex pattern sought
Date Sat, 26 May 2012 07:57:07 -0700
Organization A noiseless patient Spider
Lines 76
Message-ID <jpqr04$94d$1@dont-email.me> (permalink)
References <e9vvr7p7l8l5kem31v5a37apdlubrqjq5e@4ax.com> <dc4ca9b0-9aa9-4fe1-bbc9-2d3a28250a9d@googlegroups.com> <a2aeesF2s0U1@mid.individual.net> <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
Injection-Date Sat, 26 May 2012 14:57:09 +0000 (UTC)
Injection-Info mx04.eternal-september.org; posting-host="2kn9RzOWSe/v/hLnHgGT4Q"; logging-data="9357"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ynOWannzHuO8JJULMk8fr32uyuv+/wc0="
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
In-Reply-To <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com>
Cancel-Lock sha1:UcYKe2Xv9xgPhGADrftrPGxQSTk=
Xref csiph.com comp.lang.java.programmer:14814

Show key headers only | View raw


On 5/26/2012 6:19 AM, Roedy Green wrote:

>          exercisePattern( Pattern.compile(
> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
> empty strings
>          // (?: ) is a non-capturing group. This is Robert Klemme's
> contribution. I don't understand how it works.


Ah, OK, so here's my contribution to your excellent SSCCE.  First this 
pattern is basically the same as mine.  It uses alternation (the 
vertical bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

        "[^\"]*"
        ^^^^^^^^
        12     3
Example for the first part:
   1. "        string starts with double quote
   2. [^\"]*   doesn't contain a "
   3. "        ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width 
strings.

The other part didn't appear in your problem statement, but in HTML/XML 
it's allowed to escape characters.  E.g., 'Bob\'s your uncle.'  So his 
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
                   12 345     6

1. Start a group
2. A slash.  It needs to be escaped for regex, hence \\.
3. . is regex "any character".  2 and 3 together mean "match \ followed 
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ".  I 
think this is a mistake:  the \ needs to be quoted.
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing 
group, to make the regex do a little less work.

   "(?:\\.|[^\"])*"

Phew!  Next, he adds one alternation and does the same for a ' delimited 
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's 
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.)  This allows empty strings between delimiters;  instead of a 
* use + for only non-empty strings between the quotes.



My executive summary:

Regex is a great rapid development tool, except when it isn't.  You 
realize your problem is simple, and you could have hand-coded a parser 
to do this much quicker than all these news post exchanges?

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-25 14:45 -0700
  Re: simple regex pattern sought markspace <-@.> - 2012-05-25 14:55 -0700
  Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 14:55 -0700
    Re: simple regex pattern sought markspace <-@.> - 2012-05-25 15:04 -0700
      Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-26 14:07 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 18:34 -0700
          Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-27 11:39 -0700
    Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 15:03 -0700
    Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 00:12 +0200
      Re: simple regex pattern sought markspace <-@.> - 2012-05-25 18:43 -0700
        Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 16:37 +0200
          Re: simple regex pattern sought markspace <-@.> - 2012-05-26 08:06 -0700
            Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:34 +0200
              Re: simple regex pattern sought Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com> - 2012-05-26 10:07 -0700
      Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 06:19 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:19 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:57 -0700
          Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:13 +0200
            Re: simple regex pattern sought markspace <-@.> - 2012-05-26 10:08 -0700
              Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 14:14 -0700

csiph-web