Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #21862 > unrolled thread
| Started by | Sebastian <news@seyweiler.dyndns.org> |
|---|---|
| First post | 2013-01-30 10:34 +0100 |
| Last post | 2013-02-02 00:08 +0100 |
| Articles | 19 — 9 participants |
Back to article view | Back to comp.lang.java.programmer
Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-01-30 10:34 +0100
Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:05 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:26 -0500
Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:07 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:27 -0500
Re: Regex: Any character in character class Arved Sandstrom <asandstrom2@eastlink.ca> - 2013-02-01 05:35 -0400
Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-01 21:14 +0100
Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-01 12:54 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 16:47 -0500
Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-01 14:06 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 17:13 -0500
Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-02 20:45 +0100
Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-02 12:20 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 16:03 -0500
Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-02 13:23 -0800
Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 20:18 -0500
Re: Regex: Any character in character class Gene Wirchenko <genew@telus.net> - 2013-02-04 14:26 -0800
Re: Regex: Any character in character class Martin Gregorie <martin@address-in-sig.invalid> - 2013-02-05 00:03 +0000
Re: Regex: Any character in character class Robert Klemme <shortcutter@googlemail.com> - 2013-02-02 00:08 +0100
| From | Sebastian <news@seyweiler.dyndns.org> |
|---|---|
| Date | 2013-01-30 10:34 +0100 |
| Subject | Regex: Any character in character class |
| Message-ID | <keapdj$aqh$1@news.albasani.net> |
I want to match any sequence of characters, including line breaks, in a suffix of a multi-line string. I do not want to use Pattern.DOTALL, because line breaks are not permissible everywhere. I cannot write [.]* because dot loses its special meaning inside a character class. I have come up with [\S\s]* as meaning any sequence of non-whitespace or whitespace (incl. line-breaks). Is there a better way? -- Sebastian
[toc] | [next] | [standalone]
| From | Mikhail Vladimirov <vladimirow@mail.ru> |
|---|---|
| Date | 2013-01-30 02:05 -0800 |
| Message-ID | <bf443cc2-1b97-4e28-8dce-60690fe1905a@googlegroups.com> |
| In reply to | #21862 |
What about [^]?
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-01-30 22:26 -0500 |
| Message-ID | <5109e465$0$295$14726298@news.sunsite.dk> |
| In reply to | #21864 |
On 1/30/2013 5:05 AM, Mikhail Vladimirov wrote: > What about [^]? java.util.regex.PatternSyntaxException Arne
[toc] | [prev] | [next] | [standalone]
| From | Mikhail Vladimirov <vladimirow@mail.ru> |
|---|---|
| Date | 2013-01-30 02:07 -0800 |
| Message-ID | <274bdf05-9273-44f9-aacd-1aee5fdf91dc@googlegroups.com> |
| In reply to | #21862 |
Another option is .|\n
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-01-30 22:27 -0500 |
| Message-ID | <5109e49b$0$295$14726298@news.sunsite.dk> |
| In reply to | #21862 |
On 1/30/2013 4:34 AM, Sebastian wrote: > I want to match any sequence of characters, including line breaks, in a > suffix of a multi-line string. > > I do not want to use Pattern.DOTALL, because line breaks are not > permissible everywhere. I cannot write [.]* because dot loses its > special meaning inside a character class. > > I have come up with [\S\s]* > as meaning any sequence of non-whitespace or whitespace (incl. > line-breaks). Is there a better way? Do you always want to accept line breaks or not? If not then when? Arne
[toc] | [prev] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom2@eastlink.ca> |
|---|---|
| Date | 2013-02-01 05:35 -0400 |
| Message-ID | <I%LOs.8145$Sq4.8135@newsfe14.iad> |
| In reply to | #21910 |
On 01/30/2013 11:27 PM, Arne Vajhøj wrote: > On 1/30/2013 4:34 AM, Sebastian wrote: >> I want to match any sequence of characters, including line breaks, in a >> suffix of a multi-line string. >> >> I do not want to use Pattern.DOTALL, because line breaks are not >> permissible everywhere. I cannot write [.]* because dot loses its >> special meaning inside a character class. >> >> I have come up with [\S\s]* >> as meaning any sequence of non-whitespace or whitespace (incl. >> line-breaks). Is there a better way? > > Do you always want to accept line breaks or not? If not then when? > > Arne > > Good question. I take it the suffix is a generic last-N characters of the string (Assumption #1). I take it that line breaks are OK in the suffix, not necessarily so in the rest of the string (Assumption #2). If you don't mind me asking, why don't you just grab the suffix, the last N characters, with substring()? That *is* your match. AHS
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <news@seyweiler.dyndns.org> |
|---|---|
| Date | 2013-02-01 21:14 +0100 |
| Message-ID | <keh7mp$pjm$1@news.albasani.net> |
| In reply to | #21910 |
Am 31.01.2013 04:27, schrieb Arne Vajhøj: > On 1/30/2013 4:34 AM, Sebastian wrote: >> I want to match any sequence of characters, including line breaks, in a >> suffix of a multi-line string. >> >> I do not want to use Pattern.DOTALL, because line breaks are not >> permissible everywhere. I cannot write [.]* because dot loses its >> special meaning inside a character class. >> >> I have come up with [\S\s]* >> as meaning any sequence of non-whitespace or whitespace (incl. >> line-breaks). Is there a better way? > > Do you always want to accept line breaks or not? If not then when? > > Arne > > the string I want to match basicallyhas two parts (a "protocol" and a "selection expression"). I want to allow line breaks anywhere in the selection expression, but not in the protocol. -- S.
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2013-02-01 12:54 -0800 |
| Message-ID | <9463f7ba-413e-4604-bbf0-a5a7f5c9984d@googlegroups.com> |
| In reply to | #21972 |
Sebastian wrote: > the string I want to match basicallyhas two parts (a "protocol" and a > "selection expression"). I want to allow line breaks anywhere in the > selection expression, but not in the protocol. How do you tell which part is which? -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-02-01 16:47 -0500 |
| Message-ID | <510c380b$0$287$14726298@news.sunsite.dk> |
| In reply to | #21972 |
On 2/1/2013 3:14 PM, Sebastian wrote: > Am 31.01.2013 04:27, schrieb Arne Vajhøj: >> On 1/30/2013 4:34 AM, Sebastian wrote: >>> I want to match any sequence of characters, including line breaks, in a >>> suffix of a multi-line string. >>> >>> I do not want to use Pattern.DOTALL, because line breaks are not >>> permissible everywhere. I cannot write [.]* because dot loses its >>> special meaning inside a character class. >>> >>> I have come up with [\S\s]* >>> as meaning any sequence of non-whitespace or whitespace (incl. >>> line-breaks). Is there a better way? >> >> Do you always want to accept line breaks or not? If not then when? > the string I want to match basicallyhas two parts (a "protocol" and a > "selection expression"). I want to allow line breaks anywhere in the > selection expression, but not in the protocol. Do you have a separator between the two parts like colon in URL's? If yes then something like: [.]+:[.|\n]+ Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-02-01 14:06 -0800 |
| Message-ID | <kehe9j$otg$1@dont-email.me> |
| In reply to | #21974 |
On 2/1/2013 1:47 PM, Arne Vajhøj wrote: > [.]+:[.|\n]+ Watch out for this. +, being greedy, will match a : in the selection expression (the 2nd part) if : is allowed in the second part. The reluctant modifier might be a better idea here: .+?:[.|\n]+ Note that I don't think the initial brackets [] were needed. Also we're yet again starting to see the problem with regex: it always evolves into something that looks like your cat walked across the keyboard.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-02-01 17:13 -0500 |
| Message-ID | <510c3e29$0$287$14726298@news.sunsite.dk> |
| In reply to | #21977 |
On 2/1/2013 5:06 PM, markspace wrote: > On 2/1/2013 1:47 PM, Arne Vajhøj wrote: > >> [.]+:[.|\n]+ > > > Watch out for this. +, being greedy, will match a : in the selection > expression (the 2nd part) if : is allowed in the second part. > > The reluctant modifier might be a better idea here: > > .+?:[.|\n]+ > > Note that I don't think the initial brackets [] were needed. Also we're > yet again starting to see the problem with regex: it always evolves into > something that looks like your cat walked across the keyboard. You are absolutely right. Non greedy. No square brackets for first part. And also round brackets for the last part. .+?:(.|\n)+ I think I must have set a new world record. 3 bugs in 12 characters. :-( Arne
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <news@seyweiler.dyndns.org> |
|---|---|
| Date | 2013-02-02 20:45 +0100 |
| Message-ID | <kejqcn$lnc$1@news.albasani.net> |
| In reply to | #21978 |
Am 01.02.2013 23:13, schrieb Arne Vajhøj: [snip] > And also round brackets for the last part. > > .+?:(.|\n)+ > > I think I must have set a new world record. 3 bugs in 12 characters. > > :-( > > Arne > Here's a concrete example: SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs, company:company] The second part is everything after the first comma. I was using (.+?),[\s\S]+ Arne's suggestion modified for my needs (comma as separator, and I only want to capture the first part as a group) will work fine as well: (.+?),(?:.|\n)+ Can't say though that I find anything to prefer the one to the other. Perhaps the second looks even more like the result of a cat walk... -- Sebastian
[toc] | [prev] | [next] | [standalone]
| From | markspace <markspace@nospam.nospam> |
|---|---|
| Date | 2013-02-02 12:20 -0800 |
| Message-ID | <kejsdv$ut0$1@dont-email.me> |
| In reply to | #22020 |
On 2/2/2013 11:45 AM, Sebastian wrote:
> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
> company:company]
For something this simple you might want to consider just String::split().
String test =
"SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company]
";
String[] parse = test.split( ",\\s*", 2 );
System.out.println( Arrays.toString( parse ) );
This could be faster since the second half of the regex, (?:.|\n)+,
doesn't have to execute.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-02-02 16:03 -0500 |
| Message-ID | <510d7f3c$0$289$14726298@news.sunsite.dk> |
| In reply to | #22020 |
On 2/2/2013 2:45 PM, Sebastian wrote: > Am 01.02.2013 23:13, schrieb Arne Vajhøj: > [snip] >> And also round brackets for the last part. >> >> .+?:(.|\n)+ >> >> I think I must have set a new world record. 3 bugs in 12 characters. >> >> :-( >> > Here's a concrete example: > > SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs, > company:company] > > > The second part is everything after the first comma. I was using > (.+?),[\s\S]+ > > Arne's suggestion modified for my needs (comma as separator, and I only > want to capture the first part as a group) will work fine as well: > (.+?),(?:.|\n)+ > > Can't say though that I find anything to prefer the one to the other. > Perhaps the second looks even more like the result of a cat walk... It is not unusual that there is more than one regex that does the job. Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2013-02-02 13:23 -0800 |
| Message-ID | <e7fe2ed3-a818-4d4d-8744-a4ddfcd71909@googlegroups.com> |
| In reply to | #22025 |
Arne Vajhøj wrote:
> Sebastian wrote:
>> schrieb Arne Vajhᅵj:
>> [snip]
>>> And also round brackets for the last part.
>>>
>>> .+?:(.|\n)+
>>>
>>> I think I must have set a new world record. 3 bugs in 12 characters.
>
>>> :-(
>
>> Here's a concrete example:
>>
>> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
>> company:company]
>
>> The second part is everything after the first comma. I was using
You mean 'expression.substring(expression.indexOf(',') + 1)'?
(modulo the usual error checks, of course)
> > (.+?),[\s\S]+
>> Arne's suggestion modified for my needs (comma as separator, and I only
>> want to capture the first part as a group) will work fine as well:
You mean 'expression.substring(0, expression.indexOf(','))'?
> > (.+?),(?:.|\n)+
>
>> Can't say though that I find anything to prefer the one to the other.
>> Perhaps the second looks even more like the result of a cat walk...
If all you need to do is split a string on a comma, why use regexes at all?
> It is not unusual that there is more than one regex that
> does the job.
It is not unusual that there is more than one non-regex that does the job.
--
Lew
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2013-02-02 20:18 -0500 |
| Message-ID | <510dbb00$0$295$14726298@news.sunsite.dk> |
| In reply to | #22026 |
On 2/2/2013 4:23 PM, Lew wrote: > Arne Vajhøj wrote: >> Sebastian wrote: >>> Can't say though that I find anything to prefer the one to the other. >>> Perhaps the second looks even more like the result of a cat walk... > > If all you need to do is split a string on a comma, why use regexes at all? > >> It is not unusual that there is more than one regex that >> does the job. > > It is not unusual that there is more than one non-regex that does the job. True. But less surprising. Arne
[toc] | [prev] | [next] | [standalone]
| From | Gene Wirchenko <genew@telus.net> |
|---|---|
| Date | 2013-02-04 14:26 -0800 |
| Message-ID | <evc0h81jev28t7eiruno7iod56s13ln2vq@4ax.com> |
| In reply to | #21978 |
On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:
[snip]
>I think I must have set a new world record. 3 bugs in 12 characters.
>
>:-(
I may be able to save your honour. <G>
IBM had bugs in a one-instruction program of two bytes long. The
program was IEFBR14, and you can read about it on Wikipedia. There
was a series of corrections which resulted in a program several times
larger.
Sincerely,
Gene Wirchenko
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2013-02-05 00:03 +0000 |
| Message-ID | <kepi95$g6u$1@localhost.localdomain> |
| In reply to | #22090 |
On Mon, 04 Feb 2013 14:26:50 -0800, Gene Wirchenko wrote:
> On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk> wrote:
>
> [snip]
>
>>I think I must have set a new world record. 3 bugs in 12 characters.
>>
>>:-(
>
> I may be able to save your honour. <G>
>
> IBM had bugs in a one-instruction program of two bytes long. The
> program was IEFBR14, and you can read about it on Wikipedia. There was
> a series of corrections which resulted in a program several times
> larger.
>
Quite apart from the programmers needing a medal for the number of bugs
they managed to write, there would seem to be at least two extra prizes
the be awarded:
Parkinson Cup: for the greatest expansion of a program without adding
any functionality.
Obscurantist Medal: to the assembler designer for creating the most
obscure assembler syntax I've ever seen.
It easily beats Elliott 503 assembler (which had
no opcode mnemonics and KDF6 assembler, which
didn't allow variable names.
Thank $DEITY I managed to avoid using an S/360 or its lineal descendants.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2013-02-02 00:08 +0100 |
| Message-ID | <an307dFaqqU1@mid.individual.net> |
| In reply to | #21972 |
On 01.02.2013 21:14, Sebastian wrote:
> Am 31.01.2013 04:27, schrieb Arne Vajhøj:
>> On 1/30/2013 4:34 AM, Sebastian wrote:
>>> I want to match any sequence of characters, including line breaks, in a
>>> suffix of a multi-line string.
>>>
>>> I do not want to use Pattern.DOTALL, because line breaks are not
>>> permissible everywhere. I cannot write [.]* because dot loses its
>>> special meaning inside a character class.
>>>
>>> I have come up with [\S\s]*
>>> as meaning any sequence of non-whitespace or whitespace (incl.
>>> line-breaks). Is there a better way?
Yes.
>> Do you always want to accept line breaks or not? If not then when?
> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.
Of course you can use DOTALL - as an embedded flag:
package rx;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Dotty {
private static final Pattern PAT =
Pattern.compile("proto.*(?s:sel.*)");
public static void main(String[] args) {
test("protoPselS");
test("protoPPselS\nS");
test("protoP\nPselS\nS");
}
public static void test(final CharSequence cs) {
System.out.println("cs=\"" + cs + "\"");
final Matcher m = PAT.matcher(cs);
if (m.matches()) {
System.out.println("Match: \"" + m.group() + "\"");
} else {
System.out.println("Mismatch");
}
System.out.println();
}
}
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web