Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #14819

Re: simple regex pattern sought

From markspace <-@.>
Newsgroups comp.lang.java.programmer
Subject Re: simple regex pattern sought
Date 2012-05-26 10:08 -0700
Organization A noiseless patient Spider
Message-ID <jpr2nb$pbb$1@dont-email.me> (permalink)
References (1 earlier) <dc4ca9b0-9aa9-4fe1-bbc9-2d3a28250a9d@googlegroups.com> <a2aeesF2s0U1@mid.individual.net> <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> <jpqr04$94d$1@dont-email.me> <a2ca88Ft90U1@mid.individual.net>

Show all headers | View raw


On 5/26/2012 8:13 AM, Robert Klemme wrote:
> On 26.05.2012 16:57, markspace wrote:
>> Finally I think this could be simplified slightly with Lew's
>> back-reference idea.
>>
>> (['"])(?:\\.|[^\1\\])*
>>
>> (Untested.) This allows empty strings between delimiters; instead of a *
>> use + for only non-empty strings between the quotes.
>
> Interesting approach - but it doesn't work. Simple test with
> Pattern.compile("(.)[a\\1]"):
>
> Exception in thread "main" java.util.regex.PatternSyntaxException:
> Illegal/unsupported escape sequence near index 6
> (.)[a\1]
> ^


Yup, [] is for characters, and \1 could be a string.  Gets rejected.  I 
think you could use "negative lookahead" to say "not this string" when 
parsing.  Gets kinda ugly though.

<http://www.regular-expressions.info/conditional.html>

Java:

   "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

Regex:

   (['"])(?:\\.|(?!\1|\\).)+\1

I re-did Roedy's test program to be a bit more clear about what it was 
looking for, and the results.  This could be even cleaner if it was run 
with a JUnit test harness.

At this point though the regex is basically just a mess.  Download antlr 
and get an XML/HTML grammar from online.



package quicktest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
  *
  * @author Brenden
  */
public class MindProdRegex {

}

/*
  * [TestRegexFindQuotedString.java]
  *
  * Summary: Finding a quoted String with a regex.
.
  *
  * Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
  *
  * Licence: This software may be copied and used freely for any
purpose but military.
  *          http://mindprod.com/contact/nonmil.html
  *
  * Requires: JDK 1.7+
  *
  * Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
  *
  * Version History:
  *  1.0 2012-05-25 initial release
  */

/**
  * Finding a quoted String with a regex.
  *
  * @author Roedy Green, Canadian Mind Products
  * @version 1.0 2012-05-25 initial release
  * @since 2012-05-25
  */
class TestRegexFindQuotedString
     {
     // ------------------------------ 
CONSTANTS------------------------------

     private static final String[] vectors =
           {"Basic: George said \"that's theticket\".",
                "\"that's theticket\"",
            "Nested: Jeb replied '\"ticket?\"what ticket'.",
                "'\"ticket?\"what ticket'",
            "Non-ASCII: \"How na\u00efve!\".",
                "\"How na\u00efve!\"",
            " empty: \"\"xx",
               "\"\"",
            " escaped: 'Bob\\'s your uncle.'",
               "'Bob\\'s your uncle.'",
            " 'unbalanced\"",
               "",
           };

     // -------------------------- STATIC METHODS--------------------------

     /**
      * exercise that pattern to see what if can find
      */
     static void exercisePattern( Pattern pattern )
         {
         out.println();
         out.println( "Pattern: " + pattern.toString() );
            for( int i = 0; i < vectors.length; i+=2 ) {
               String test = vectors[i];
               String result = vectors[i+1];
               final Matcher m = pattern.matcher( test );
               boolean found = m.find();
               boolean correct = false;
               String groupString = null;
               if( found ) {
                  correct = m.group(0).equals( result );
                  groupString = m.group();
               }
               System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
            }
         }

     // --------------------------- main() method---------------------------

     /**
      * test harness
      *
      * @param args not used
      */
     public static void main( String[] args )
         {
         // We want to find Strings of the form "xx'xx" or 'xx"xx'
         // We want to avoid the following problems:
         // 1. Works even if String contains foreign languages, 
evenRussian or accented letters.
         // 2. If starts with " must end with ", if starts with ' 
mustend with '.
         // 3. ' is ok inside "...", and " is ok inside '...'
         // 4. We don't worry about how to use ' inside '...'.

         // here are some suggested techniques:

         exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
);  // fails 1 2 3

         exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); 
//fails 2 3

         exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); 
//fails 3, uses a capturing group.

         exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); 
//works, rejects empty strings by Mark Space.
         exercisePattern( Pattern.compile( 
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings 
by Mark Space.

         exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); 
//works, accepts empty strings by Robert Klemme.
         exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty 
strings
         // (?: ) is a non-capturing group. This is Robert 
Klemme'scontribution. I don't understand how it works.
         }
     }

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-25 14:45 -0700
  Re: simple regex pattern sought markspace <-@.> - 2012-05-25 14:55 -0700
  Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 14:55 -0700
    Re: simple regex pattern sought markspace <-@.> - 2012-05-25 15:04 -0700
      Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-26 14:07 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 18:34 -0700
          Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-27 11:39 -0700
    Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 15:03 -0700
    Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 00:12 +0200
      Re: simple regex pattern sought markspace <-@.> - 2012-05-25 18:43 -0700
        Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 16:37 +0200
          Re: simple regex pattern sought markspace <-@.> - 2012-05-26 08:06 -0700
            Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:34 +0200
              Re: simple regex pattern sought Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com> - 2012-05-26 10:07 -0700
      Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 06:19 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:19 -0700
        Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:57 -0700
          Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:13 +0200
            Re: simple regex pattern sought markspace <-@.> - 2012-05-26 10:08 -0700
              Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 14:14 -0700

csiph-web