Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #14799 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2012-05-25 14:45 -0700 |
| Last post | 2012-05-26 14:14 -0700 |
| Articles | 20 — 6 participants |
Back to article view | Back to comp.lang.java.programmer
simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-25 14:45 -0700
Re: simple regex pattern sought markspace <-@.> - 2012-05-25 14:55 -0700
Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 14:55 -0700
Re: simple regex pattern sought markspace <-@.> - 2012-05-25 15:04 -0700
Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-26 14:07 -0700
Re: simple regex pattern sought markspace <-@.> - 2012-05-26 18:34 -0700
Re: simple regex pattern sought Lew <noone@lewscanon.com> - 2012-05-27 11:39 -0700
Re: simple regex pattern sought Lew <lewbloch@gmail.com> - 2012-05-25 15:03 -0700
Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 00:12 +0200
Re: simple regex pattern sought markspace <-@.> - 2012-05-25 18:43 -0700
Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 16:37 +0200
Re: simple regex pattern sought markspace <-@.> - 2012-05-26 08:06 -0700
Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:34 +0200
Re: simple regex pattern sought Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com> - 2012-05-26 10:07 -0700
Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 06:19 -0700
Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:19 -0700
Re: simple regex pattern sought markspace <-@.> - 2012-05-26 07:57 -0700
Re: simple regex pattern sought Robert Klemme <shortcutter@googlemail.com> - 2012-05-26 17:13 +0200
Re: simple regex pattern sought markspace <-@.> - 2012-05-26 10:08 -0700
Re: simple regex pattern sought Roedy Green <see_website@mindprod.com.invalid> - 2012-05-26 14:14 -0700
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-05-25 14:45 -0700 |
| Subject | simple regex pattern sought |
| Message-ID | <e9vvr7p7l8l5kem31v5a37apdlubrqjq5e@4ax.com> |
I often have to search for things of the form "xxxxx" or 'xxxxx' where xxx is anything not " or '. It might be Russian or English or any other language. What is the cleanest way to do that? -- Roedy Green Canadian Mind Products http://mindprod.com I would be quite surprised if the NSA (National Security Agency) did not have a computer program to scan bits of shredded documents and electronically put them back together like a giant jigsaw puzzle. This suggests you cannot just shred, you must also burn. .
[toc] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-25 14:55 -0700 |
| Message-ID | <jpov3n$75g$1@dont-email.me> |
| In reply to | #14799 |
On 5/25/2012 2:45 PM, Roedy Green wrote: > I often have to search for things of the form > > "xxxxx" > or > 'xxxxx' > > where xxx is anything not " or '. It might be Russian or English or > any other language. > > What is the cleanest way to do that? Would this work? '[^']+'|"[^"]+"
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-05-25 14:55 -0700 |
| Message-ID | <dc4ca9b0-9aa9-4fe1-bbc9-2d3a28250a9d@googlegroups.com> |
| In reply to | #14799 |
Roedy Green wrote: > I often have to search for things of the form > > "xxxxx" > or > 'xxxxx' > > where xxx is anything not " or '. It might be Russian or English or > any other language. > > What is the cleanest way to do that? Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-25 15:04 -0700 |
| Message-ID | <jpovld$9la$1@dont-email.me> |
| In reply to | #14801 |
On 5/25/2012 2:55 PM, Lew wrote: > > Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know. > This would match "John's restaurant" as "John'. The first quote matches ", John does not contain either ' or " as specified, and the last character class matches the '. Not I think what is wanted.
[toc] | [prev] | [next] | [standalone]
| From | Lew <noone@lewscanon.com> |
|---|---|
| Date | 2012-05-26 14:07 -0700 |
| Message-ID | <jprgls$vnb$1@news.albasani.net> |
| In reply to | #14802 |
markspace wrote: > Lew wrote: >> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know. >> > This would match "John's restaurant" as "John'. > > The first quote matches ", John does not contain either ' or " as specified, > and the last character class matches the '. Not I think what is wanted. As I correct6ed in my very next post. -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-26 18:34 -0700 |
| Message-ID | <jps0a9$58k$1@dont-email.me> |
| In reply to | #14820 |
On 5/26/2012 2:07 PM, Lew wrote: > markspace wrote: >> Lew wrote: >>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I >>> don't know. >>> >> This would match "John's restaurant" as "John'. >> >> The first quote matches ", John does not contain either ' or " as >> specified, >> and the last character class matches the '. Not I think what is wanted. > > As I correct6ed in my very next post. > Unfortunately that one doesn't work either. The central part, [^"'], doesn't allow a match of a ' if the starting delimiter was a ", and that doesn't match Roedy's spec. "John's restaurant" wouldn't be matched at all, because the matcher couldn't match past the ' to get to the ". I think the easiest is to write out a grammar for the expression, then translate to regex. QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING SQUOTED_STRING := ' NON_S_QUOTE + ' DQUOTED_STRING := " NON_D_QUOTE + " NON_S_QUOTE := [^'] NON_D_QUOTE := [^"] At this point the grammar is very clear. (Note I haven't included Robert's \x escape sequences.) I think it's worth learning to use antlr rather than regex, which tends to obfuscate more than it helps. However, a literal translation into regex isn't hard, and a literal translation avoids mis-optimizations.
[toc] | [prev] | [next] | [standalone]
| From | Lew <noone@lewscanon.com> |
|---|---|
| Date | 2012-05-27 11:39 -0700 |
| Message-ID | <jptscb$t82$1@news.albasani.net> |
| In reply to | #14827 |
markspace wrote: > Lew wrote: >> markspace wrote: >>> Lew wrote: >>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I >>>> don't know. >>>> >>> This would match "John's restaurant" as "John'. >>> >>> The first quote matches ", John does not contain either ' or " as >>> specified, >>> and the last character class matches the '. Not I think what is wanted. >> >> As I correct6ed in my very next post. > > Unfortunately that one doesn't work either. The central part, [^"'], doesn't > allow a match of a ' if the starting delimiter was a ", and that doesn't match > Roedy's spec. "John's restaurant" wouldn't be matched at all, because the > matcher couldn't match past the ' to get to the ". > > I think the easiest is to write out a grammar for the expression, then > translate to regex. > > QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING > > SQUOTED_STRING := ' NON_S_QUOTE + ' > > DQUOTED_STRING := " NON_D_QUOTE + " > > NON_S_QUOTE := [^'] > > NON_D_QUOTE := [^"] > > At this point the grammar is very clear. (Note I haven't included Robert's \x > escape sequences.) I think it's worth learning to use antlr rather than regex, > which tends to obfuscate more than it helps. However, a literal translation > into regex isn't hard, and a literal translation avoids mis-optimizations. Very illuminating. Thank you. -- Lew Honi soit qui mal y pense. http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-05-25 15:03 -0700 |
| Message-ID | <9c1a694e-cfe8-4fed-bdb2-0550c5c2d288@googlegroups.com> |
| In reply to | #14801 |
On Friday, May 25, 2012 2:55:07 PM UTC-7, Lew wrote: > Roedy Green wrote: > > I often have to search for things of the form > > > > "xxxxx" > > or > > 'xxxxx' > > > > where xxx is anything not " or '. It might be Russian or English or > > any other language. > > > > What is the cleanest way to do that? > > Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know. "([\"'])[^\"']+\\1" That way you match the opening quote. (The extra backslashes are to escape the characters in the string. Regex sees one fewer per each set.) -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-05-26 00:12 +0200 |
| Message-ID | <a2aeesF2s0U1@mid.individual.net> |
| In reply to | #14801 |
On 25.05.2012 23:55, Lew wrote:
> Roedy Green wrote:
>> I often have to search for things of the form
>>
>> "xxxxx"
>> or
>> 'xxxxx'
>>
>> where xxx is anything not " or '. It might be Russian or English or
>> any other language.
>>
>> What is the cleanest way to do that?
>
> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
That does not match quoting properly. Better do something like
"([\"'])[^\"']*\\1"
Still I prefer
"\"[^\"]*\"|'[^']*'"
Because it allows for quotes of the other type inside quotes.
With proper escaping (using \ as escape char, any other works, too) this
becomes
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
Kind regards
robert
package rx;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Quotes {
private static final Pattern Q1 = Pattern.compile("([\"'])[^\"']*\\1");
private static final Pattern Q2 = Pattern.compile("\"[^\"]*\"|'[^']*'");
private static final Pattern Q3 =
Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'");
public static void main(String[] args) {
System.out.println(Q1);
for (final Matcher m = Q1.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}
System.out.println(Q2);
for (final Matcher m = Q2.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}
System.out.println(Q3);
for (final Matcher m = Q3.matcher("'a' \"\\\"b\" 'c'"); m.find();) {
System.out.println(m.group());
}
}
}
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-25 18:43 -0700 |
| Message-ID | <jppch0$f18$1@dont-email.me> |
| In reply to | #14804 |
On 5/25/2012 3:12 PM, Robert Klemme wrote: > "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" This looks overly baroque to me. You don't need to escape \ single quotes ' in a Java string, and I don't think you need to in a regex either (although I didn't check that). I'm also not seeing the need for the parenthesis around the character classes [] (but again, without having tried it, I could be wrong). And the dot . inside the parenthesis just looks wrong. Great post overall though.
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-05-26 16:37 +0200 |
| Message-ID | <a2c84rF3pmU1@mid.individual.net> |
| In reply to | #14807 |
On 26.05.2012 03:43, markspace wrote: > On 5/25/2012 3:12 PM, Robert Klemme wrote: > >> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" > > > This looks overly baroque to me. You don't need to escape \ single > quotes ' in a Java string, I didn't. > and I don't think you need to in a regex > either (although I didn't check that). There is also no regexp escaping of single quotes either. The only regexp escaping you can see are the \\\\ which translate into \\ in the string which is a literal backslash for the regexp engine. > I'm also not seeing the need for > the parenthesis around the character classes [] (but again, without > having tried it, I could be wrong). It's not parenthesis around character classes but around the alternative of "match a backslash followed by any char" and "any char which is not backslash or the opening quote type of this string variant". > And the dot . inside the parenthesis just looks wrong. It isn't - see above. > Great post overall though. Thank you! It does seem to need some time to sink in though... :-) Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-26 08:06 -0700 |
| Message-ID | <jpqri7$cl5$1@dont-email.me> |
| In reply to | #14813 |
On 5/26/2012 7:37 AM, Robert Klemme wrote: > On 26.05.2012 03:43, markspace wrote: >> On 5/25/2012 3:12 PM, Robert Klemme wrote: >> >>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" >> ... >> and I don't think you need to in a regex >> either (although I didn't check that). > > There is also no regexp escaping of single quotes either. The only > regexp escaping you can see are the \\\\ which translate into \\ in the > string which is a literal backslash for the regexp engine. Yes, there is, although I think it's a typo. Both \\\" and \\' get passed to the regex as \" and \', which means just a single character " and ' respectively. You're right about the rest of it though. With so many \'s floating around, I have a hard time reading Java regex! > It's not parenthesis around character classes but around the alternative > of "match a backslash followed by any char" and "any char which is not > backslash or the opening quote type of this string variant". Yup, I totally missed this too. Thanks for pointing it out.
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-05-26 17:34 +0200 |
| Message-ID | <a2cbh0Fbj4U1@mid.individual.net> |
| In reply to | #14815 |
On 26.05.2012 17:06, markspace wrote: > On 5/26/2012 7:37 AM, Robert Klemme wrote: >> On 26.05.2012 03:43, markspace wrote: >>> On 5/25/2012 3:12 PM, Robert Klemme wrote: >>> >>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" >>> > ... >>> and I don't think you need to in a regex >>> either (although I didn't check that). >> >> There is also no regexp escaping of single quotes either. The only >> regexp escaping you can see are the \\\\ which translate into \\ in the >> string which is a literal backslash for the regexp engine. > > > Yes, there is, although I think it's a typo. Both \\\" and \\' get > passed to the regex as \" and \', which means just a single character " > and ' respectively. Right you are - both times: there is regexp escapind and it was in fact a typo (missing \\)! > You're right about the rest of it though. With so many \'s floating > around, I have a hard time reading Java regex! That's true for other languages as well - the basic reason is that the same character is used for - escaping in strings - escaping in backslashes - escaping in the source text (in this case we could pick another character) >> It's not parenthesis around character classes but around the alternative >> of "match a backslash followed by any char" and "any char which is not >> backslash or the opening quote type of this string variant". > > > Yup, I totally missed this too. Thanks for pointing it out. You're welcome! Thank you again for finding the missing escape. Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | Peter Duniho <NpOeStPeAdM@NnOwSlPiAnMk.com> |
|---|---|
| Date | 2012-05-26 10:07 -0700 |
| Message-ID | <rempy0kwlswk.15pwi33dd9ku2$.dlg@40tude.net> |
| In reply to | #14817 |
On Sat, 26 May 2012 17:34:49 +0200, Robert Klemme wrote: > [...] >> You're right about the rest of it though. With so many \'s floating >> around, I have a hard time reading Java regex! > > That's true for other languages as well Not C#, which allows string literals to be prefaced with the @ symbol to disable compiler escaping. In fact, I'll bet C# wasn't the first language to have such a feature. Surely there are many other languages that also avoid the issue.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-05-26 06:19 -0700 |
| Message-ID | <6sl1s7dpqhg4l0gfa5duva3j8m9rf9opr5@4ax.com> |
| In reply to | #14804 |
On Sat, 26 May 2012 00:12:34 +0200, Robert Klemme
<shortcutter@googlemail.com> wrote, quoted or indirectly quoted
someone who said :
>On 25.05.2012 23:55, Lew wrote:
>> Roedy Green wrote:
>>> I often have to search for things of the form
>>>
>>> "xxxxx"
>>> or
>>> 'xxxxx'
>>>
>>> where xxx is anything not " or '. It might be Russian or English or
>>> any other language.
/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
.
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/
package com.mindprod.example;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.lang.System.out;
/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
public class TestRegexFindQuotedString
{
// ------------------------------ CONSTANTS
------------------------------
private static final String lookIn = "George said \"that's the
ticket\"." +
" Jeb replied '\"ticket?\"
what ticket'." +
" \"How na\u00efve!\"." +
" empty: \"\"" +
" 'unbalanced\"";
// -------------------------- STATIC METHODS
--------------------------
/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
final Matcher m = pattern.matcher( lookIn ); // Matchers are
used both for matching and finding.
while ( m.find() )
{
out.println( m.group( 0 ) );
}
}
// --------------------------- main() method
---------------------------
/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages, even
Russian or accented letters.
// 2. If starts with " must end with ", if starts with ' must
end with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.
// here are some suggested techniques:
exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3
exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); //
fails 2 3
exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //
fails 3, uses a capturing group.
exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //
works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
// (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.
}
}
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
.
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-26 07:19 -0700 |
| Message-ID | <jpqop1$s7h$1@dont-email.me> |
| In reply to | #14810 |
On 5/26/2012 6:19 AM, Roedy Green wrote: > exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); // > works, rejects empty strings by Mark Space. If you want it to accept empty strings, replace the +'s with *'s. You didn't specify empty strings in your original problem statement, so I decided to disallow them. Thanks for posting that SSCCE, btw. I was too lazy to cook one up.
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-26 07:57 -0700 |
| Message-ID | <jpqr04$94d$1@dont-email.me> |
| In reply to | #14810 |
On 5/26/2012 6:19 AM, Roedy Green wrote:
> exercisePattern( Pattern.compile(
> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
> empty strings
> // (?: ) is a non-capturing group. This is Robert Klemme's
> contribution. I don't understand how it works.
Ah, OK, so here's my contribution to your excellent SSCCE. First this
pattern is basically the same as mine. It uses alternation (the
vertical bar |) to pick a string delimited by either ' or "
Here's his regex string without the extra escapes for Java:
"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^
Let's look at just the first half for a moment, without the (?:\\. part.
"[^\"]*"
^^^^^^^^
12 3
Example for the first part:
1. " string starts with double quote
2. [^\"]* doesn't contain a "
3. " ends with double quote
Same for the second half of the string.
Notice he's using * instead of +'s, which is why his matches 0 width
strings.
The other part didn't appear in your problem statement, but in HTML/XML
it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
inclusion is very reasonable.
So he Robert adds (\\.|[^\"])* to the first part, which is
12 345 6
1. Start a group
2. A slash. It needs to be escaped for regex, hence \\.
3. . is regex "any character". 2 and 3 together mean "match \ followed
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ". I
think this is a mistake: the \ needs to be quoted.
6. zero or more.
Then after that mess, he does the obvious thing and adds non-capturing
group, to make the regex do a little less work.
"(?:\\.|[^\"])*"
Phew! Next, he adds one alternation and does the same for a ' delimited
string.
|'(?:\\.|[^\'])*'
Same thing, just ' instead of ".
Finally I think this could be simplified slightly with Lew's
back-reference idea.
(['"])(?:\\.|[^\1\\])*
(Untested.) This allows empty strings between delimiters; instead of a
* use + for only non-empty strings between the quotes.
My executive summary:
Regex is a great rapid development tool, except when it isn't. You
realize your problem is simple, and you could have hand-coded a parser
to do this much quicker than all these news post exchanges?
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2012-05-26 17:13 +0200 |
| Message-ID | <a2ca88Ft90U1@mid.individual.net> |
| In reply to | #14814 |
On 26.05.2012 16:57, markspace wrote:
> On 5/26/2012 6:19 AM, Roedy Green wrote:
>
>> exercisePattern( Pattern.compile(
>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
>> empty strings
>> // (?: ) is a non-capturing group. This is Robert Klemme's
>> contribution. I don't understand how it works.
>
>
> Ah, OK, so here's my contribution to your excellent SSCCE. First this
> pattern is basically the same as mine. It uses alternation (the vertical
> bar |) to pick a string delimited by either ' or "
>
> Here's his regex string without the extra escapes for Java:
>
> "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
> ^^^^^^^^^^^^^^^^
>
> Let's look at just the first half for a moment, without the (?:\\. part.
>
> "[^\"]*"
> ^^^^^^^^
> 12 3
> Example for the first part:
> 1. " string starts with double quote
> 2. [^\"]* doesn't contain a "
> 3. " ends with double quote
>
> Same for the second half of the string.
>
> Notice he's using * instead of +'s, which is why his matches 0 width
> strings.
>
> The other part didn't appear in your problem statement, but in HTML/XML
> it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
> inclusion is very reasonable.
>
> So he Robert adds (\\.|[^\"])* to the first part, which is
> 12 345 6
>
> 1. Start a group
> 2. A slash. It needs to be escaped for regex, hence \\.
> 3. . is regex "any character". 2 and 3 together mean "match \ followed
> by any character"
> 4. OR (alternation again)
> 5. character class, negated (the ^), matches anything except \ or ". I
> think this is a mistake: the \ needs to be quoted.
Oh, right, thanks for finding that!
> 6. zero or more.
>
> Then after that mess, he does the obvious thing and adds non-capturing
> group, to make the regex do a little less work.
>
> "(?:\\.|[^\"])*"
>
> Phew! Next, he adds one alternation and does the same for a ' delimited
> string.
>
> |'(?:\\.|[^\'])*'
>
> Same thing, just ' instead of ".
>
> Finally I think this could be simplified slightly with Lew's
> back-reference idea.
>
> (['"])(?:\\.|[^\1\\])*
>
> (Untested.) This allows empty strings between delimiters; instead of a *
> use + for only non-empty strings between the quotes.
Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^
> My executive summary:
>
> Regex is a great rapid development tool, except when it isn't. You
> realize your problem is simple, and you could have hand-coded a parser
> to do this much quicker than all these news post exchanges?
Maybe, maybe not.
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-05-26 10:08 -0700 |
| Message-ID | <jpr2nb$pbb$1@dont-email.me> |
| In reply to | #14816 |
On 5/26/2012 8:13 AM, Robert Klemme wrote:
> On 26.05.2012 16:57, markspace wrote:
>> Finally I think this could be simplified slightly with Lew's
>> back-reference idea.
>>
>> (['"])(?:\\.|[^\1\\])*
>>
>> (Untested.) This allows empty strings between delimiters; instead of a *
>> use + for only non-empty strings between the quotes.
>
> Interesting approach - but it doesn't work. Simple test with
> Pattern.compile("(.)[a\\1]"):
>
> Exception in thread "main" java.util.regex.PatternSyntaxException:
> Illegal/unsupported escape sequence near index 6
> (.)[a\1]
> ^
Yup, [] is for characters, and \1 could be a string. Gets rejected. I
think you could use "negative lookahead" to say "not this string" when
parsing. Gets kinda ugly though.
<http://www.regular-expressions.info/conditional.html>
Java:
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"
Regex:
(['"])(?:\\.|(?!\1|\\).)+\1
I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.
At this point though the regex is basically just a mess. Download antlr
and get an XML/HTML grammar from online.
package quicktest;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.lang.System.out;
/**
*
* @author Brenden
*/
public class MindProdRegex {
}
/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
.
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/
/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
class TestRegexFindQuotedString
{
// ------------------------------
CONSTANTS------------------------------
private static final String[] vectors =
{"Basic: George said \"that's theticket\".",
"\"that's theticket\"",
"Nested: Jeb replied '\"ticket?\"what ticket'.",
"'\"ticket?\"what ticket'",
"Non-ASCII: \"How na\u00efve!\".",
"\"How na\u00efve!\"",
" empty: \"\"xx",
"\"\"",
" escaped: 'Bob\\'s your uncle.'",
"'Bob\\'s your uncle.'",
" 'unbalanced\"",
"",
};
// -------------------------- STATIC METHODS--------------------------
/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
for( int i = 0; i < vectors.length; i+=2 ) {
String test = vectors[i];
String result = vectors[i+1];
final Matcher m = pattern.matcher( test );
boolean found = m.find();
boolean correct = false;
String groupString = null;
if( found ) {
correct = m.group(0).equals( result );
groupString = m.group();
}
System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
}
}
// --------------------------- main() method---------------------------
/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages,
evenRussian or accented letters.
// 2. If starts with " must end with ", if starts with '
mustend with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.
// here are some suggested techniques:
exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3
exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
//fails 2 3
exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
//fails 3, uses a capturing group.
exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
//works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile(
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
by Mark Space.
exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
//works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
strings
// (?: ) is a non-capturing group. This is Robert
Klemme'scontribution. I don't understand how it works.
}
}
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-05-26 14:14 -0700 |
| Message-ID | <6kh2s7hbkam8ojb3mq9onmflikn12ibh85@4ax.com> |
| In reply to | #14819 |
On Sat, 26 May 2012 10:08:58 -0700, markspace <-@.> wrote, quoted or indirectly quoted someone who said : >I re-did Roedy's test program to be a bit more clear about what it was >looking for, and the results. This could be even cleaner if it was run >with a JUnit test harness. Thanks Brendan. I have incorporated your suggestions plus a bit more polishing. See http://mindprod.com/jgloss/regex.html#FINDQUOTED for a formatted listing + output. The next task, probably procrastinated, is to solve it with a little finite state automaton that decodes \x as well, and a simpler version without. If a newbie is interested in tackling that, they can look at my Java snippet parser as part of JPrep/JDisplay and strip it down. -- Roedy Green Canadian Mind Products http://mindprod.com I would be quite surprised if the NSA (National Security Agency) did not have a computer program to scan bits of shredded documents and electronically put them back together like a giant jigsaw puzzle. This suggests you cannot just shred, you must also burn. .
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web