Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #14799 > unrolled thread

simple regex pattern sought

Started byRoedy Green <see_website@mindprod.com.invalid>
First post2012-05-25 14:45 -0700
Last post2012-05-26 14:14 -0700
Articles 20 — 6 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-25 14:45 -0700
    Re: simple regex pattern sought markspace <-@.> - 2012-05-25 14:55 -0700
    Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 14:55 -0700
      Re: simple regex pattern sought markspace <-@.> - 2012-05-25 15:04 -0700
        Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-26 14:07 -0700
          Re: simple regex pattern sought markspace <-@.> - 2012-05-26 18:34 -0700
            Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-27 11:39 -0700
      Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 15:03 -0700
      Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 00:12 +0200
        Re: simple regex pattern sought markspace <-@.> - 2012-05-25 18:43 -0700
          Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 16:37 +0200
            Re: simple regex pattern sought markspace <-@.> - 2012-05-26 08:06 -0700
              Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:34 +0200
                Re: simple regex pattern sought Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com> - 2012-05-26 10:07 -0700
        Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 06:19 -0700
          Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:19 -0700
          Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:57 -0700
            Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:13 +0200
              Re: simple regex pattern sought markspace <-@.> - 2012-05-26 10:08 -0700
                Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 14:14 -0700

#14799 — simple regex pattern sought

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-05-25 14:45 -0700
Subjectsimple regex pattern sought
Message-ID<e9vvr7p7l8l5kem31v5a37apdlubrqjq5e@4ax.com>
I often have to search for things of the form 

"xxxxx"
or 
'xxxxx'

where xxx is anything not " or '.  It might be Russian or English or
any other language.

What is the cleanest way to do that?
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
.

[toc] | [next] | [standalone]


#14800

Frommarkspace <-@.>
Date2012-05-25 14:55 -0700
Message-ID<jpov3n$75g$1@dont-email.me>
In reply to#14799
On 5/25/2012 2:45 PM, Roedy Green wrote:
> I often have to search for things of the form
>
> "xxxxx"
> or
> 'xxxxx'
>
> where xxx is anything not " or '.  It might be Russian or English or
> any other language.
>
> What is the cleanest way to do that?


Would this work?

  '[^']+'|"[^"]+"

[toc] | [prev] | [next] | [standalone]


#14801

FromLew <lewbloch@gmail.com>
Date2012-05-25 14:55 -0700
Message-ID<dc4ca9b0-9aa9-4fe1-bbc9-2d3a28250a9d@googlegroups.com>
In reply to#14799
Roedy Green wrote:
> I often have to search for things of the form 
> 
> "xxxxx"
> or 
> 'xxxxx'
> 
> where xxx is anything not " or '.  It might be Russian or English or
> any other language.
> 
> What is the cleanest way to do that?

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#14802

Frommarkspace <-@.>
Date2012-05-25 15:04 -0700
Message-ID<jpovld$9la$1@dont-email.me>
In reply to#14801
On 5/25/2012 2:55 PM, Lew wrote:

>
> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
>

This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as 
specified, and the last character class matches the '.  Not I think what 
is wanted.

[toc] | [prev] | [next] | [standalone]


#14820

FromLew <noone@lewscanon.com>
Date2012-05-26 14:07 -0700
Message-ID<jprgls$vnb$1@news.albasani.net>
In reply to#14802
markspace wrote:
> Lew wrote:
>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
>>
> This would match "John's restaurant" as "John'.
>
> The first quote matches ", John does not contain either ' or " as specified,
> and the last character class matches the '. Not I think what is wanted.

As I correct6ed in my very next post.

-- 
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

[toc] | [prev] | [next] | [standalone]


#14827

Frommarkspace <-@.>
Date2012-05-26 18:34 -0700
Message-ID<jps0a9$58k$1@dont-email.me>
In reply to#14820
On 5/26/2012 2:07 PM, Lew wrote:
> markspace wrote:
>> Lew wrote:
>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
>>> don't know.
>>>
>> This would match "John's restaurant" as "John'.
>>
>> The first quote matches ", John does not contain either ' or " as
>> specified,
>> and the last character class matches the '. Not I think what is wanted.
>
> As I correct6ed in my very next post.
>


Unfortunately that one doesn't work either.  The central part, [^"'], 
doesn't allow a match of a ' if the starting delimiter was a ", and that 
doesn't match Roedy's spec.  "John's restaurant" wouldn't be matched at 
all, because the matcher couldn't match past the ' to get to the ".

I think the easiest is to write out a grammar for the expression, then 
translate to regex.

QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

SQUOTED_STRING := ' NON_S_QUOTE + '

DQUOTED_STRING := " NON_D_QUOTE + "

NON_S_QUOTE := [^']

NON_D_QUOTE := [^"]

At this point the grammar is very clear.  (Note I haven't included 
Robert's \x escape sequences.)  I think it's worth learning to use antlr 
rather than regex, which tends to obfuscate more than it helps. 
However, a literal translation into regex isn't hard, and a literal 
translation avoids mis-optimizations.

[toc] | [prev] | [next] | [standalone]


#14835

FromLew <noone@lewscanon.com>
Date2012-05-27 11:39 -0700
Message-ID<jptscb$t82$1@news.albasani.net>
In reply to#14827
markspace wrote:
> Lew wrote:
>> markspace wrote:
>>> Lew wrote:
>>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
>>>> don't know.
>>>>
>>> This would match "John's restaurant" as "John'.
>>>
>>> The first quote matches ", John does not contain either ' or " as
>>> specified,
>>> and the last character class matches the '. Not I think what is wanted.
>>
>> As I correct6ed in my very next post.
>
> Unfortunately that one doesn't work either. The central part, [^"'], doesn't
> allow a match of a ' if the starting delimiter was a ", and that doesn't match
> Roedy's spec. "John's restaurant" wouldn't be matched at all, because the
> matcher couldn't match past the ' to get to the ".
>
> I think the easiest is to write out a grammar for the expression, then
> translate to regex.
>
> QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING
>
> SQUOTED_STRING := ' NON_S_QUOTE + '
>
> DQUOTED_STRING := " NON_D_QUOTE + "
>
> NON_S_QUOTE := [^']
>
> NON_D_QUOTE := [^"]
>
> At this point the grammar is very clear. (Note I haven't included Robert's \x
> escape sequences.) I think it's worth learning to use antlr rather than regex,
> which tends to obfuscate more than it helps. However, a literal translation
> into regex isn't hard, and a literal translation avoids mis-optimizations.

Very illuminating. Thank you.

-- 
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

[toc] | [prev] | [next] | [standalone]


#14803

FromLew <lewbloch@gmail.com>
Date2012-05-25 15:03 -0700
Message-ID<9c1a694e-cfe8-4fed-bdb2-0550c5c2d288@googlegroups.com>
In reply to#14801
On Friday, May 25, 2012 2:55:07 PM UTC-7, Lew wrote:
> Roedy Green wrote:
> > I often have to search for things of the form 
> > 
> > "xxxxx"
> > or 
> > 'xxxxx'
> > 
> > where xxx is anything not " or '.  It might be Russian or English or
> > any other language.
> > 
> > What is the cleanest way to do that?
> 
> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

"([\"'])[^\"']+\\1"

That way you match the opening quote.

(The extra backslashes are to escape the characters in the string. Regex sees one fewer per each set.)

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#14804

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-05-26 00:12 +0200
Message-ID<a2aeesF2s0U1@mid.individual.net>
In reply to#14801
On 25.05.2012 23:55, Lew wrote:
> Roedy Green wrote:
>> I often have to search for things of the form
>>
>> "xxxxx"
>> or
>> 'xxxxx'
>>
>> where xxx is anything not " or '.  It might be Russian or English or
>> any other language.
>>
>> What is the cleanest way to do that?
>
> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

That does not match quoting properly.  Better do something like

"([\"'])[^\"']*\\1"

Still I prefer

"\"[^\"]*\"|'[^']*'"

Because it allows for quotes of the other type inside quotes.

With proper escaping (using \ as escape char, any other works, too) this 
becomes

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

Kind regards

	robert


package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Quotes {

   private static final Pattern Q1 = Pattern.compile("([\"'])[^\"']*\\1");
   private static final Pattern Q2 = Pattern.compile("\"[^\"]*\"|'[^']*'");
   private static final Pattern Q3 = 
Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'");

   public static void main(String[] args) {
     System.out.println(Q1);
     for (final Matcher m = Q1.matcher("'a' \"b\" 'c'"); m.find();) {
       System.out.println(m.group());
     }

     System.out.println(Q2);
     for (final Matcher m = Q2.matcher("'a' \"b\" 'c'"); m.find();) {
       System.out.println(m.group());
     }

     System.out.println(Q3);
     for (final Matcher m = Q3.matcher("'a' \"\\\"b\" 'c'"); m.find();) {
       System.out.println(m.group());
     }
   }

}


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]


#14807

Frommarkspace <-@.>
Date2012-05-25 18:43 -0700
Message-ID<jppch0$f18$1@dont-email.me>
In reply to#14804
On 5/25/2012 3:12 PM, Robert Klemme wrote:

> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"


This looks overly baroque to me.  You don't need to escape \ single 
quotes ' in a Java string, and I don't think you need to in a regex 
either (although I didn't check that).  I'm also not seeing the need for 
the parenthesis around the character classes [] (but again, without 
having tried it, I could be wrong).  And the dot . inside the 
parenthesis just looks wrong.

Great post overall though.

[toc] | [prev] | [next] | [standalone]


#14813

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-05-26 16:37 +0200
Message-ID<a2c84rF3pmU1@mid.individual.net>
In reply to#14807
On 26.05.2012 03:43, markspace wrote:
> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>
>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>
>
> This looks overly baroque to me. You don't need to escape \ single
> quotes ' in a Java string,

I didn't.

> and I don't think you need to in a regex
> either (although I didn't check that).

There is also no regexp escaping of single quotes either.  The only 
regexp escaping you can see are the \\\\ which translate into \\ in the 
string which is a literal backslash for the regexp engine.

> I'm also not seeing the need for
> the parenthesis around the character classes [] (but again, without
> having tried it, I could be wrong).

It's not parenthesis around character classes but around the alternative 
of "match a backslash followed by any char" and "any char which is not 
backslash or the opening quote type of this string variant".

> And the dot . inside the parenthesis just looks wrong.

It isn't - see above.

> Great post overall though.

Thank you!  It does seem to need some time to sink in though... :-)

Kind regards

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]


#14815

Frommarkspace <-@.>
Date2012-05-26 08:06 -0700
Message-ID<jpqri7$cl5$1@dont-email.me>
In reply to#14813
On 5/26/2012 7:37 AM, Robert Klemme wrote:
> On 26.05.2012 03:43, markspace wrote:
>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>
>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>>
...
>> and I don't think you need to in a regex
>> either (although I didn't check that).
>
> There is also no regexp escaping of single quotes either. The only
> regexp escaping you can see are the \\\\ which translate into \\ in the
> string which is a literal backslash for the regexp engine.


Yes, there is, although I think it's a typo.  Both \\\" and \\' get 
passed to the regex as \" and \', which means just a single character " 
and ' respectively.

You're right about the rest of it though.  With so many \'s floating 
around, I have a hard time reading Java regex!


> It's not parenthesis around character classes but around the alternative
> of "match a backslash followed by any char" and "any char which is not
> backslash or the opening quote type of this string variant".


Yup, I totally missed this too.  Thanks for pointing it out.

[toc] | [prev] | [next] | [standalone]


#14817

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-05-26 17:34 +0200
Message-ID<a2cbh0Fbj4U1@mid.individual.net>
In reply to#14815
On 26.05.2012 17:06, markspace wrote:
> On 5/26/2012 7:37 AM, Robert Klemme wrote:
>> On 26.05.2012 03:43, markspace wrote:
>>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>>
>>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>>>
> ...
>>> and I don't think you need to in a regex
>>> either (although I didn't check that).
>>
>> There is also no regexp escaping of single quotes either. The only
>> regexp escaping you can see are the \\\\ which translate into \\ in the
>> string which is a literal backslash for the regexp engine.
>
>
> Yes, there is, although I think it's a typo. Both \\\" and \\' get
> passed to the regex as \" and \', which means just a single character "
> and ' respectively.

Right you are - both times: there is regexp escapind and it was in fact 
a typo (missing \\)!

> You're right about the rest of it though. With so many \'s floating
> around, I have a hard time reading Java regex!

That's true for other languages as well - the basic reason is that the 
same character is used for

  - escaping in strings
  - escaping in backslashes
  - escaping in the source text (in this case we could pick another 
character)

>> It's not parenthesis around character classes but around the alternative
>> of "match a backslash followed by any char" and "any char which is not
>> backslash or the opening quote type of this string variant".
>
>
> Yup, I totally missed this too. Thanks for pointing it out.

You're welcome!  Thank you again for finding the missing escape.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]


#14818

FromPeter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com>
Date2012-05-26 10:07 -0700
Message-ID<rempy0kwlswk.15pwi33dd9ku2$.dlg@40tude.net>
In reply to#14817
On Sat, 26 May 2012 17:34:49 +0200, Robert Klemme wrote:

> [...]
>> You're right about the rest of it though. With so many \'s floating
>> around, I have a hard time reading Java regex!
> 
> That's true for other languages as well

Not C#, which allows string literals to be prefaced with the @ symbol to
disable compiler escaping.

In fact, I'll bet C# wasn't the first language to have such a feature.
Surely there are many other languages that also avoid the issue.

[toc] | [prev] | [next] | [standalone]


#14810

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-05-26 06:19 -0700
Message-ID<6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com>
In reply to#14804
On Sat, 26 May 2012 00:12:34 +0200, Robert Klemme
<shortcutter@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

>On 25.05.2012 23:55, Lew wrote:
>> Roedy Green wrote:
>>> I often have to search for things of the form
>>>
>>> "xxxxx"
>>> or
>>> 'xxxxx'
>>>
>>> where xxx is anything not " or '.  It might be Russian or English or
>>> any other language.
/*
 * [TestRegexFindQuotedString.java]
 *
 * Summary: Finding a quoted String with a regex.
.
 *
 * Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
 *
 * Licence: This software may be copied and used freely for any
purpose but military.
 *          http://mindprod.com/contact/nonmil.html
 *
 * Requires: JDK 1.7+
 *
 * Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
 *
 * Version History:
 *  1.0 2012-05-25 initial release
 */
package com.mindprod.example;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
 * Finding a quoted String with a regex.
 *
 * @author Roedy Green, Canadian Mind Products
 * @version 1.0 2012-05-25 initial release
 * @since 2012-05-25
 */
public class TestRegexFindQuotedString
    {
    // ------------------------------ CONSTANTS
------------------------------

    private static final String lookIn = "George said \"that's the
ticket\"." +
                                         " Jeb replied '\"ticket?\"
what ticket'." +
                                         " \"How na\u00efve!\"." +
                                         " empty: \"\"" +
                                         " 'unbalanced\"";

    // -------------------------- STATIC METHODS
--------------------------

    /**
     * exercise that pattern to see what if can find
     */
    static void exercisePattern( Pattern pattern )
        {
        out.println();
        out.println( "Pattern: " + pattern.toString() );
        final Matcher m = pattern.matcher( lookIn );  // Matchers are
used both for matching and finding.
        while ( m.find() )
            {
            out.println( m.group( 0 ) );
            }
        }

    // --------------------------- main() method
---------------------------

    /**
     * test harness
     *
     * @param args not used
     */
    public static void main( String[] args )
        {
        // We want to find Strings of the form "xx'xx" or 'xx"xx'
        // We want to avoid the following problems:
        // 1. Works even if String contains foreign languages, even
Russian or accented letters.
        // 2. If starts with " must end with ", if starts with ' must
end with '.
        // 3. ' is ok inside "...", and " is ok inside '...'
        // 4. We don't worry about how to use ' inside '...'.

        // here are some suggested techniques:

        exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
);  // fails 1 2 3

        exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );  //
fails 2 3

        exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //
fails 3, uses a capturing group.

        exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
works, rejects empty strings by Mark Space.

        exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //
works, accepts empty strings by Robert Klemme.

        exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
        // (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.
        }
    }
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
.

[toc] | [prev] | [next] | [standalone]


#14812

Frommarkspace <-@.>
Date2012-05-26 07:19 -0700
Message-ID<jpqop1$s7h$1@dont-email.me>
In reply to#14810
On 5/26/2012 6:19 AM, Roedy Green wrote:

>          exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
> works, rejects empty strings by Mark Space.


If you want it to accept empty strings, replace the +'s with *'s.  You 
didn't specify empty strings in your original problem statement, so I 
decided to disallow them.

Thanks for posting that SSCCE, btw.  I was too lazy to cook one up.

[toc] | [prev] | [next] | [standalone]


#14814

Frommarkspace <-@.>
Date2012-05-26 07:57 -0700
Message-ID<jpqr04$94d$1@dont-email.me>
In reply to#14810
On 5/26/2012 6:19 AM, Roedy Green wrote:

>          exercisePattern( Pattern.compile(
> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
> empty strings
>          // (?: ) is a non-capturing group. This is Robert Klemme's
> contribution. I don't understand how it works.


Ah, OK, so here's my contribution to your excellent SSCCE.  First this 
pattern is basically the same as mine.  It uses alternation (the 
vertical bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

        "[^\"]*"
        ^^^^^^^^
        12     3
Example for the first part:
   1. "        string starts with double quote
   2. [^\"]*   doesn't contain a "
   3. "        ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width 
strings.

The other part didn't appear in your problem statement, but in HTML/XML 
it's allowed to escape characters.  E.g., 'Bob\'s your uncle.'  So his 
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
                   12 345     6

1. Start a group
2. A slash.  It needs to be escaped for regex, hence \\.
3. . is regex "any character".  2 and 3 together mean "match \ followed 
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ".  I 
think this is a mistake:  the \ needs to be quoted.
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing 
group, to make the regex do a little less work.

   "(?:\\.|[^\"])*"

Phew!  Next, he adds one alternation and does the same for a ' delimited 
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's 
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.)  This allows empty strings between delimiters;  instead of a 
* use + for only non-empty strings between the quotes.



My executive summary:

Regex is a great rapid development tool, except when it isn't.  You 
realize your problem is simple, and you could have hand-coded a parser 
to do this much quicker than all these news post exchanges?

[toc] | [prev] | [next] | [standalone]


#14816

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-05-26 17:13 +0200
Message-ID<a2ca88Ft90U1@mid.individual.net>
In reply to#14814
On 26.05.2012 16:57, markspace wrote:
> On 5/26/2012 6:19 AM, Roedy Green wrote:
>
>> exercisePattern( Pattern.compile(
>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
>> empty strings
>> // (?: ) is a non-capturing group. This is Robert Klemme's
>> contribution. I don't understand how it works.
>
>
> Ah, OK, so here's my contribution to your excellent SSCCE. First this
> pattern is basically the same as mine. It uses alternation (the vertical
> bar |) to pick a string delimited by either ' or "
>
> Here's his regex string without the extra escapes for Java:
>
> "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
> ^^^^^^^^^^^^^^^^
>
> Let's look at just the first half for a moment, without the (?:\\. part.
>
> "[^\"]*"
> ^^^^^^^^
> 12 3
> Example for the first part:
> 1. " string starts with double quote
> 2. [^\"]* doesn't contain a "
> 3. " ends with double quote
>
> Same for the second half of the string.
>
> Notice he's using * instead of +'s, which is why his matches 0 width
> strings.
>
> The other part didn't appear in your problem statement, but in HTML/XML
> it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
> inclusion is very reasonable.
>
> So he Robert adds (\\.|[^\"])* to the first part, which is
> 12 345 6
>
> 1. Start a group
> 2. A slash. It needs to be escaped for regex, hence \\.
> 3. . is regex "any character". 2 and 3 together mean "match \ followed
> by any character"
> 4. OR (alternation again)
> 5. character class, negated (the ^), matches anything except \ or ". I
> think this is a mistake: the \ needs to be quoted.

Oh, right, thanks for finding that!

> 6. zero or more.
>
> Then after that mess, he does the obvious thing and adds non-capturing
> group, to make the regex do a little less work.
>
> "(?:\\.|[^\"])*"
>
> Phew! Next, he adds one alternation and does the same for a ' delimited
> string.
>
> |'(?:\\.|[^\'])*'
>
> Same thing, just ' instead of ".
>
> Finally I think this could be simplified slightly with Lew's
> back-reference idea.
>
> (['"])(?:\\.|[^\1\\])*
>
> (Untested.) This allows empty strings between delimiters; instead of a *
> use + for only non-empty strings between the quotes.

Interesting approach - but it doesn't work.  Simple test with 
Pattern.compile("(.)[a\\1]"):

Exception in thread "main" java.util.regex.PatternSyntaxException: 
Illegal/unsupported escape sequence near index 6
(.)[a\1]
       ^

> My executive summary:
>
> Regex is a great rapid development tool, except when it isn't. You
> realize your problem is simple, and you could have hand-coded a parser
> to do this much quicker than all these news post exchanges?

Maybe, maybe not.

Kind regards

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]


#14819

Frommarkspace <-@.>
Date2012-05-26 10:08 -0700
Message-ID<jpr2nb$pbb$1@dont-email.me>
In reply to#14816
On 5/26/2012 8:13 AM, Robert Klemme wrote:
> On 26.05.2012 16:57, markspace wrote:
>> Finally I think this could be simplified slightly with Lew's
>> back-reference idea.
>>
>> (['"])(?:\\.|[^\1\\])*
>>
>> (Untested.) This allows empty strings between delimiters; instead of a *
>> use + for only non-empty strings between the quotes.
>
> Interesting approach - but it doesn't work. Simple test with
> Pattern.compile("(.)[a\\1]"):
>
> Exception in thread "main" java.util.regex.PatternSyntaxException:
> Illegal/unsupported escape sequence near index 6
> (.)[a\1]
> ^


Yup, [] is for characters, and \1 could be a string.  Gets rejected.  I 
think you could use "negative lookahead" to say "not this string" when 
parsing.  Gets kinda ugly though.

<http://www.regular-expressions.info/conditional.html>

Java:

   "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

Regex:

   (['"])(?:\\.|(?!\1|\\).)+\1

I re-did Roedy's test program to be a bit more clear about what it was 
looking for, and the results.  This could be even cleaner if it was run 
with a JUnit test harness.

At this point though the regex is basically just a mess.  Download antlr 
and get an XML/HTML grammar from online.



package quicktest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
  *
  * @author Brenden
  */
public class MindProdRegex {

}

/*
  * [TestRegexFindQuotedString.java]
  *
  * Summary: Finding a quoted String with a regex.
.
  *
  * Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
  *
  * Licence: This software may be copied and used freely for any
purpose but military.
  *          http://mindprod.com/contact/nonmil.html
  *
  * Requires: JDK 1.7+
  *
  * Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
  *
  * Version History:
  *  1.0 2012-05-25 initial release
  */

/**
  * Finding a quoted String with a regex.
  *
  * @author Roedy Green, Canadian Mind Products
  * @version 1.0 2012-05-25 initial release
  * @since 2012-05-25
  */
class TestRegexFindQuotedString
     {
     // ------------------------------ 
CONSTANTS------------------------------

     private static final String[] vectors =
           {"Basic: George said \"that's theticket\".",
                "\"that's theticket\"",
            "Nested: Jeb replied '\"ticket?\"what ticket'.",
                "'\"ticket?\"what ticket'",
            "Non-ASCII: \"How na\u00efve!\".",
                "\"How na\u00efve!\"",
            " empty: \"\"xx",
               "\"\"",
            " escaped: 'Bob\\'s your uncle.'",
               "'Bob\\'s your uncle.'",
            " 'unbalanced\"",
               "",
           };

     // -------------------------- STATIC METHODS--------------------------

     /**
      * exercise that pattern to see what if can find
      */
     static void exercisePattern( Pattern pattern )
         {
         out.println();
         out.println( "Pattern: " + pattern.toString() );
            for( int i = 0; i < vectors.length; i+=2 ) {
               String test = vectors[i];
               String result = vectors[i+1];
               final Matcher m = pattern.matcher( test );
               boolean found = m.find();
               boolean correct = false;
               String groupString = null;
               if( found ) {
                  correct = m.group(0).equals( result );
                  groupString = m.group();
               }
               System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
            }
         }

     // --------------------------- main() method---------------------------

     /**
      * test harness
      *
      * @param args not used
      */
     public static void main( String[] args )
         {
         // We want to find Strings of the form "xx'xx" or 'xx"xx'
         // We want to avoid the following problems:
         // 1. Works even if String contains foreign languages, 
evenRussian or accented letters.
         // 2. If starts with " must end with ", if starts with ' 
mustend with '.
         // 3. ' is ok inside "...", and " is ok inside '...'
         // 4. We don't worry about how to use ' inside '...'.

         // here are some suggested techniques:

         exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
);  // fails 1 2 3

         exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); 
//fails 2 3

         exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); 
//fails 3, uses a capturing group.

         exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); 
//works, rejects empty strings by Mark Space.
         exercisePattern( Pattern.compile( 
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings 
by Mark Space.

         exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); 
//works, accepts empty strings by Robert Klemme.
         exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty 
strings
         // (?: ) is a non-capturing group. This is Robert 
Klemme'scontribution. I don't understand how it works.
         }
     }

[toc] | [prev] | [next] | [standalone]


#14822

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-05-26 14:14 -0700
Message-ID<6kh2s7hbkam8ojb3mq9onmflikn12ibh85@4ax.com>
In reply to#14819
On Sat, 26 May 2012 10:08:58 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>I re-did Roedy's test program to be a bit more clear about what it was 
>looking for, and the results.  This could be even cleaner if it was run 
>with a JUnit test harness.

Thanks Brendan.  I have incorporated your suggestions plus a bit more
polishing.

See http://mindprod.com/jgloss/regex.html#FINDQUOTED

for a formatted listing + output.

 The next task, probably procrastinated, is to solve it with a little
finite state automaton that decodes \x as well, and a simpler version
without.  If a newbie is interested in tackling that, they can look at
my Java snippet parser as part of JPrep/JDisplay and strip it down. 
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.programmer


csiph-web