Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.musoftware.de!wum.musoftware.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Robert Klemme Newsgroups: comp.lang.java.programmer Subject: Re: simple regex pattern sought Date: Sat, 26 May 2012 17:13:05 +0200 Lines: 96 Message-ID: References: <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: individual.net kNOn+w5T98CeReWXaEBKIwMXQDDmN1nSb5WYtPALBhejGFZjjgR+Nv85GQOKAP2p0= Cancel-Lock: sha1:PS1U2PP/OBLWGGrDblEAe5BCjK8= User-Agent: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 In-Reply-To: Xref: csiph.com comp.lang.java.programmer:14816 On 26.05.2012 16:57, markspace wrote: > On 5/26/2012 6:19 AM, Roedy Green wrote: > >> exercisePattern( Pattern.compile( >> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts >> empty strings >> // (?: ) is a non-capturing group. This is Robert Klemme's >> contribution. I don't understand how it works. > > > Ah, OK, so here's my contribution to your excellent SSCCE. First this > pattern is basically the same as mine. It uses alternation (the vertical > bar |) to pick a string delimited by either ' or " > > Here's his regex string without the extra escapes for Java: > > "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*' > ^^^^^^^^^^^^^^^^ > > Let's look at just the first half for a moment, without the (?:\\. part. > > "[^\"]*" > ^^^^^^^^ > 12 3 > Example for the first part: > 1. " string starts with double quote > 2. [^\"]* doesn't contain a " > 3. " ends with double quote > > Same for the second half of the string. > > Notice he's using * instead of +'s, which is why his matches 0 width > strings. > > The other part didn't appear in your problem statement, but in HTML/XML > it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his > inclusion is very reasonable. > > So he Robert adds (\\.|[^\"])* to the first part, which is > 12 345 6 > > 1. Start a group > 2. A slash. It needs to be escaped for regex, hence \\. > 3. . is regex "any character". 2 and 3 together mean "match \ followed > by any character" > 4. OR (alternation again) > 5. character class, negated (the ^), matches anything except \ or ". I > think this is a mistake: the \ needs to be quoted. Oh, right, thanks for finding that! > 6. zero or more. > > Then after that mess, he does the obvious thing and adds non-capturing > group, to make the regex do a little less work. > > "(?:\\.|[^\"])*" > > Phew! Next, he adds one alternation and does the same for a ' delimited > string. > > |'(?:\\.|[^\'])*' > > Same thing, just ' instead of ". > > Finally I think this could be simplified slightly with Lew's > back-reference idea. > > (['"])(?:\\.|[^\1\\])* > > (Untested.) This allows empty strings between delimiters; instead of a * > use + for only non-empty strings between the quotes. Interesting approach - but it doesn't work. Simple test with Pattern.compile("(.)[a\\1]"): Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 6 (.)[a\1] ^ > My executive summary: > > Regex is a great rapid development tool, except when it isn't. You > realize your problem is simple, and you could have hand-coded a parser > to do this much quicker than all these news post exchanges? Maybe, maybe not. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/