Path: csiph.com!usenet.pasdenom.info!news.albasani.net!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: markspace <-@.> Newsgroups: comp.lang.java.programmer Subject: Re: simple regex pattern sought Date: Sat, 26 May 2012 10:08:58 -0700 Organization: A noiseless patient Spider Lines: 175 Message-ID: References: <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sat, 26 May 2012 17:09:00 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="2kn9RzOWSe/v/hLnHgGT4Q"; logging-data="25963"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+0mudUIwmmK8xUyqFum2rCRaG90Q2UYi8=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 In-Reply-To: Cancel-Lock: sha1:z4kTAGkYrneXFW5M8iyRcWqXRz8= Xref: csiph.com comp.lang.java.programmer:14819 On 5/26/2012 8:13 AM, Robert Klemme wrote: > On 26.05.2012 16:57, markspace wrote: >> Finally I think this could be simplified slightly with Lew's >> back-reference idea. >> >> (['"])(?:\\.|[^\1\\])* >> >> (Untested.) This allows empty strings between delimiters; instead of a * >> use + for only non-empty strings between the quotes. > > Interesting approach - but it doesn't work. Simple test with > Pattern.compile("(.)[a\\1]"): > > Exception in thread "main" java.util.regex.PatternSyntaxException: > Illegal/unsupported escape sequence near index 6 > (.)[a\1] > ^ Yup, [] is for characters, and \1 could be a string. Gets rejected. I think you could use "negative lookahead" to say "not this string" when parsing. Gets kinda ugly though. Java: "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" Regex: (['"])(?:\\.|(?!\1|\\).)+\1 I re-did Roedy's test program to be a bit more clear about what it was looking for, and the results. This could be even cleaner if it was run with a JUnit test harness. At this point though the regex is basically just a mess. Download antlr and get an XML/HTML grammar from online. package quicktest; import java.util.regex.Matcher; import java.util.regex.Pattern; import static java.lang.System.out; /** * * @author Brenden */ public class MindProdRegex { } /* * [TestRegexFindQuotedString.java] * * Summary: Finding a quoted String with a regex. . * * Copyright: (c) 2012 Roedy Green, Canadian Mind Products, http://mindprod.com * * Licence: This software may be copied and used freely for any purpose but military. * http://mindprod.com/contact/nonmil.html * * Requires: JDK 1.7+ * * Created with: JetBrains IntelliJ IDEA IDE http://www.jetbrains.com/idea/ * * Version History: * 1.0 2012-05-25 initial release */ /** * Finding a quoted String with a regex. * * @author Roedy Green, Canadian Mind Products * @version 1.0 2012-05-25 initial release * @since 2012-05-25 */ class TestRegexFindQuotedString { // ------------------------------ CONSTANTS------------------------------ private static final String[] vectors = {"Basic: George said \"that's theticket\".", "\"that's theticket\"", "Nested: Jeb replied '\"ticket?\"what ticket'.", "'\"ticket?\"what ticket'", "Non-ASCII: \"How na\u00efve!\".", "\"How na\u00efve!\"", " empty: \"\"xx", "\"\"", " escaped: 'Bob\\'s your uncle.'", "'Bob\\'s your uncle.'", " 'unbalanced\"", "", }; // -------------------------- STATIC METHODS-------------------------- /** * exercise that pattern to see what if can find */ static void exercisePattern( Pattern pattern ) { out.println(); out.println( "Pattern: " + pattern.toString() ); for( int i = 0; i < vectors.length; i+=2 ) { String test = vectors[i]; String result = vectors[i+1]; final Matcher m = pattern.matcher( test ); boolean found = m.find(); boolean correct = false; String groupString = null; if( found ) { correct = m.group(0).equals( result ); groupString = m.group(); } System.out.println( test+", found: "+ found + ", correct: "+correct+" ("+groupString+")"); } } // --------------------------- main() method--------------------------- /** * test harness * * @param args not used */ public static void main( String[] args ) { // We want to find Strings of the form "xx'xx" or 'xx"xx' // We want to avoid the following problems: // 1. Works even if String contains foreign languages, evenRussian or accented letters. // 2. If starts with " must end with ", if starts with ' mustend with '. // 3. ' is ok inside "...", and " is ok inside '...' // 4. We don't worry about how to use ' inside '...'. // here are some suggested techniques: exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" ) ); // fails 1 2 3 exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); //fails 2 3 exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //fails 3, uses a capturing group. exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //works, rejects empty strings by Mark Space. exercisePattern( Pattern.compile( "(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings by Mark Space. exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //works, accepts empty strings by Robert Klemme. exercisePattern( Pattern.compile( "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty strings // (?: ) is a non-capturing group. This is Robert Klemme'scontribution. I don't understand how it works. } }