Path: csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: markspace <-@.> Newsgroups: comp.lang.java.programmer Subject: Re: simple regex pattern sought Date: Sat, 26 May 2012 07:57:07 -0700 Organization: A noiseless patient Spider Lines: 76 Message-ID: References: <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sat, 26 May 2012 14:57:09 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="2kn9RzOWSe/v/hLnHgGT4Q"; logging-data="9357"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ynOWannzHuO8JJULMk8fr32uyuv+/wc0=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 In-Reply-To: <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> Cancel-Lock: sha1:UcYKe2Xv9xgPhGADrftrPGxQSTk= Xref: csiph.com comp.lang.java.programmer:14814 On 5/26/2012 6:19 AM, Roedy Green wrote: > exercisePattern( Pattern.compile( > "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts > empty strings > // (?: ) is a non-capturing group. This is Robert Klemme's > contribution. I don't understand how it works. Ah, OK, so here's my contribution to your excellent SSCCE. First this pattern is basically the same as mine. It uses alternation (the vertical bar |) to pick a string delimited by either ' or " Here's his regex string without the extra escapes for Java: "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*' ^^^^^^^^^^^^^^^^ Let's look at just the first half for a moment, without the (?:\\. part. "[^\"]*" ^^^^^^^^ 12 3 Example for the first part: 1. " string starts with double quote 2. [^\"]* doesn't contain a " 3. " ends with double quote Same for the second half of the string. Notice he's using * instead of +'s, which is why his matches 0 width strings. The other part didn't appear in your problem statement, but in HTML/XML it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his inclusion is very reasonable. So he Robert adds (\\.|[^\"])* to the first part, which is 12 345 6 1. Start a group 2. A slash. It needs to be escaped for regex, hence \\. 3. . is regex "any character". 2 and 3 together mean "match \ followed by any character" 4. OR (alternation again) 5. character class, negated (the ^), matches anything except \ or ". I think this is a mistake: the \ needs to be quoted. 6. zero or more. Then after that mess, he does the obvious thing and adds non-capturing group, to make the regex do a little less work. "(?:\\.|[^\"])*" Phew! Next, he adds one alternation and does the same for a ' delimited string. |'(?:\\.|[^\'])*' Same thing, just ' instead of ". Finally I think this could be simplified slightly with Lew's back-reference idea. (['"])(?:\\.|[^\1\\])* (Untested.) This allows empty strings between delimiters; instead of a * use + for only non-empty strings between the quotes. My executive summary: Regex is a great rapid development tool, except when it isn't. You realize your problem is simple, and you could have hand-coded a parser to do this much quicker than all these news post exchanges?