Groups > comp.lang.java.programmer > #21862 > unrolled thread

Regex: Any character in character class

Started by	Sebastian <news@seyweiler.dyndns.org>
First post	2013-01-30 10:34 +0100
Last post	2013-02-02 00:08 +0100
Articles	19 — 9 participants

Back to article view | Back to comp.lang.java.programmer

  Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-01-30 10:34 +0100
    Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:05 -0800
      Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:26 -0500
    Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:07 -0800
    Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:27 -0500
      Re: Regex: Any character in character class Arved Sandstrom <asandstrom2@eastlink.ca> - 2013-02-01 05:35 -0400
      Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-01 21:14 +0100
        Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-01 12:54 -0800
        Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 16:47 -0500
          Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-01 14:06 -0800
            Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 17:13 -0500
              Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-02 20:45 +0100
                Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-02 12:20 -0800
                Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 16:03 -0500
                  Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-02 13:23 -0800
                    Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 20:18 -0500
              Re: Regex: Any character in character class Gene Wirchenko <genew@telus.net> - 2013-02-04 14:26 -0800
                Re: Regex: Any character in character class Martin Gregorie <martin@address-in-sig.invalid> - 2013-02-05 00:03 +0000
        Re: Regex: Any character in character class Robert Klemme <shortcutter@googlemail.com> - 2013-02-02 00:08 +0100

#21862 — Regex: Any character in character class

From	Sebastian <news@seyweiler.dyndns.org>
Date	2013-01-30 10:34 +0100
Subject	Regex: Any character in character class
Message-ID	<keapdj$aqh$1@news.albasani.net>

I want to match any sequence of characters, including line breaks, in a 
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not 
permissible everywhere. I cannot write [.]* because dot loses its 
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl. 
line-breaks). Is there a better way?

-- Sebastian

[toc] | [next] | [standalone]

#21864

From	Mikhail Vladimirov <vladimirow@mail.ru>
Date	2013-01-30 02:05 -0800
Message-ID	<bf443cc2-1b97-4e28-8dce-60690fe1905a@googlegroups.com>
In reply to	#21862

What about [^]?

[toc] | [prev] | [next] | [standalone]

#21909

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-01-30 22:26 -0500
Message-ID	<5109e465$0$295$14726298@news.sunsite.dk>
In reply to	#21864

On 1/30/2013 5:05 AM, Mikhail Vladimirov wrote:
> What about [^]?

java.util.regex.PatternSyntaxException

Arne

[toc] | [prev] | [next] | [standalone]

#21865

From	Mikhail Vladimirov <vladimirow@mail.ru>
Date	2013-01-30 02:07 -0800
Message-ID	<274bdf05-9273-44f9-aacd-1aee5fdf91dc@googlegroups.com>
In reply to	#21862

Another option is .|\n

[toc] | [prev] | [next] | [standalone]

#21910

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-01-30 22:27 -0500
Message-ID	<5109e49b$0$295$14726298@news.sunsite.dk>
In reply to	#21862

On 1/30/2013 4:34 AM, Sebastian wrote:
> I want to match any sequence of characters, including line breaks, in a
> suffix of a multi-line string.
>
> I do not want to use Pattern.DOTALL, because line breaks are not
> permissible everywhere. I cannot write [.]* because dot loses its
> special meaning inside a character class.
>
> I have come up with [\S\s]*
> as meaning any sequence of non-whitespace or whitespace (incl.
> line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne

[toc] | [prev] | [next] | [standalone]

#21951

From	Arved Sandstrom <asandstrom2@eastlink.ca>
Date	2013-02-01 05:35 -0400
Message-ID	<I%LOs.8145$Sq4.8135@newsfe14.iad>
In reply to	#21910

On 01/30/2013 11:27 PM, Arne Vajhøj wrote:
> On 1/30/2013 4:34 AM, Sebastian wrote:
>> I want to match any sequence of characters, including line breaks, in a
>> suffix of a multi-line string.
>>
>> I do not want to use Pattern.DOTALL, because line breaks are not
>> permissible everywhere. I cannot write [.]* because dot loses its
>> special meaning inside a character class.
>>
>> I have come up with [\S\s]*
>> as meaning any sequence of non-whitespace or whitespace (incl.
>> line-breaks). Is there a better way?
>
> Do you always want to accept line breaks or not? If not then when?
>
> Arne
>
>
Good question.

I take it the suffix is a generic last-N characters of the string 
(Assumption #1). I take it that line breaks are OK in the suffix, not 
necessarily so in the rest of the string (Assumption #2).

If you don't mind me asking, why don't you just grab the suffix, the 
last N characters, with substring()? That *is* your match.

AHS

[toc] | [prev] | [next] | [standalone]

#21972

From	Sebastian <news@seyweiler.dyndns.org>
Date	2013-02-01 21:14 +0100
Message-ID	<keh7mp$pjm$1@news.albasani.net>
In reply to	#21910

Am 31.01.2013 04:27, schrieb Arne Vajhøj:
> On 1/30/2013 4:34 AM, Sebastian wrote:
>> I want to match any sequence of characters, including line breaks, in a
>> suffix of a multi-line string.
>>
>> I do not want to use Pattern.DOTALL, because line breaks are not
>> permissible everywhere. I cannot write [.]* because dot loses its
>> special meaning inside a character class.
>>
>> I have come up with [\S\s]*
>> as meaning any sequence of non-whitespace or whitespace (incl.
>> line-breaks). Is there a better way?
>
> Do you always want to accept line breaks or not? If not then when?
>
> Arne
>
>
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.
-- S.

[toc] | [prev] | [next] | [standalone]

#21973

From	Lew <lewbloch@gmail.com>
Date	2013-02-01 12:54 -0800
Message-ID	<9463f7ba-413e-4604-bbf0-a5a7f5c9984d@googlegroups.com>
In reply to	#21972

Sebastian wrote:
> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

How do you tell which part is which?

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#21974

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-02-01 16:47 -0500
Message-ID	<510c380b$0$287$14726298@news.sunsite.dk>
In reply to	#21972

On 2/1/2013 3:14 PM, Sebastian wrote:
> Am 31.01.2013 04:27, schrieb Arne Vajhøj:
>> On 1/30/2013 4:34 AM, Sebastian wrote:
>>> I want to match any sequence of characters, including line breaks, in a
>>> suffix of a multi-line string.
>>>
>>> I do not want to use Pattern.DOTALL, because line breaks are not
>>> permissible everywhere. I cannot write [.]* because dot loses its
>>> special meaning inside a character class.
>>>
>>> I have come up with [\S\s]*
>>> as meaning any sequence of non-whitespace or whitespace (incl.
>>> line-breaks). Is there a better way?
>>
>> Do you always want to accept line breaks or not? If not then when?

> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

Do you have a separator between the two parts like colon in URL's?

If yes then something like:

[.]+:[.|\n]+

Arne

[toc] | [prev] | [next] | [standalone]

#21977

From	markspace <markspace@nospam.nospam>
Date	2013-02-01 14:06 -0800
Message-ID	<kehe9j$otg$1@dont-email.me>
In reply to	#21974

On 2/1/2013 1:47 PM, Arne Vajhøj wrote:

> [.]+:[.|\n]+


Watch out for this.  +, being greedy, will match a : in the selection 
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

.+?:[.|\n]+

Note that I don't think the initial brackets [] were needed.  Also we're 
yet again starting to see the problem with regex: it always evolves into 
something that looks like your cat walked across the keyboard.

[toc] | [prev] | [next] | [standalone]

#21978

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-02-01 17:13 -0500
Message-ID	<510c3e29$0$287$14726298@news.sunsite.dk>
In reply to	#21977

On 2/1/2013 5:06 PM, markspace wrote:
> On 2/1/2013 1:47 PM, Arne Vajhøj wrote:
>
>> [.]+:[.|\n]+
>
>
> Watch out for this.  +, being greedy, will match a : in the selection
> expression (the 2nd part) if : is allowed in the second part.
>
> The reluctant modifier might be a better idea here:
>
> .+?:[.|\n]+
>
> Note that I don't think the initial brackets [] were needed.  Also we're
> yet again starting to see the problem with regex: it always evolves into
> something that looks like your cat walked across the keyboard.

You are absolutely right.

Non greedy.

No square brackets for first part.

And also round brackets for the last part.

.+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne

[toc] | [prev] | [next] | [standalone]

#22020

From	Sebastian <news@seyweiler.dyndns.org>
Date	2013-02-02 20:45 +0100
Message-ID	<kejqcn$lnc$1@news.albasani.net>
In reply to	#21978

Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]
> And also round brackets for the last part.
>
> .+?:(.|\n)+
>
> I think I must have set a new world record. 3 bugs in 12 characters.
>
> :-(
>
> Arne
>
Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]


The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only 
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

-- Sebastian

[toc] | [prev] | [next] | [standalone]

#22021

From	markspace <markspace@nospam.nospam>
Date	2013-02-02 12:20 -0800
Message-ID	<kejsdv$ut0$1@dont-email.me>
In reply to	#22020

On 2/2/2013 11:45 AM, Sebastian wrote:
> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
> company:company]

For something this simple you might want to consider just String::split().

       String test =
"SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company] 
";
       String[] parse = test.split( ",\\s*", 2 );
       System.out.println( Arrays.toString( parse ) );

This could be faster since the second half of the regex, (?:.|\n)+, 
doesn't have to execute.

[toc] | [prev] | [next] | [standalone]

#22025

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-02-02 16:03 -0500
Message-ID	<510d7f3c$0$289$14726298@news.sunsite.dk>
In reply to	#22020

On 2/2/2013 2:45 PM, Sebastian wrote:
> Am 01.02.2013 23:13, schrieb Arne Vajhøj:
> [snip]
>> And also round brackets for the last part.
>>
>> .+?:(.|\n)+
>>
>> I think I must have set a new world record. 3 bugs in 12 characters.
>>
>> :-(
>>
> Here's a concrete example:
>
> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
> company:company]
>
>
> The second part is everything after the first comma. I was using
> (.+?),[\s\S]+
>
> Arne's suggestion modified for my needs (comma as separator, and I only
> want to capture the first part as a group) will work fine as well:
> (.+?),(?:.|\n)+
>
> Can't say though that I find anything to prefer the one to the other.
> Perhaps the second looks even more like the result of a cat walk...

It is not unusual that there is more than one regex that
does the job.

Arne

[toc] | [prev] | [next] | [standalone]

#22026

From	Lew <lewbloch@gmail.com>
Date	2013-02-02 13:23 -0800
Message-ID	<e7fe2ed3-a818-4d4d-8744-a4ddfcd71909@googlegroups.com>
In reply to	#22025

Arne Vajhøj wrote:
> Sebastian wrote:
>> schrieb Arne Vajhï¿œj:
>> [snip]
>>> And also round brackets for the last part.
>>>
>>> .+?:(.|\n)+
>>>
>>> I think I must have set a new world record. 3 bugs in 12 characters.
> 
>>> :-(
> 
>> Here's a concrete example:
>>
>> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
>> company:company]
> 
>> The second part is everything after the first comma. I was using

You mean 'expression.substring(expression.indexOf(',') + 1)'?
(modulo the usual error checks, of course)

> > (.+?),[\s\S]+

>> Arne's suggestion modified for my needs (comma as separator, and I only
>> want to capture the first part as a group) will work fine as well:

You mean 'expression.substring(0, expression.indexOf(','))'?

> > (.+?),(?:.|\n)+
> 
>> Can't say though that I find anything to prefer the one to the other.
>> Perhaps the second looks even more like the result of a cat walk...

If all you need to do is split a string on a comma, why use regexes at all?

> It is not unusual that there is more than one regex that
> does the job.

It is not unusual that there is more than one non-regex that does the job.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#22033

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-02-02 20:18 -0500
Message-ID	<510dbb00$0$295$14726298@news.sunsite.dk>
In reply to	#22026

On 2/2/2013 4:23 PM, Lew wrote:
> Arne Vajhøj wrote:
>> Sebastian wrote:
>>> Can't say though that I find anything to prefer the one to the other.
>>> Perhaps the second looks even more like the result of a cat walk...
>
> If all you need to do is split a string on a comma, why use regexes at all?
>
>> It is not unusual that there is more than one regex that
>> does the job.
>
> It is not unusual that there is more than one non-regex that does the job.

True.

But less surprising.

Arne

[toc] | [prev] | [next] | [standalone]

#22090

From	Gene Wirchenko <genew@telus.net>
Date	2013-02-04 14:26 -0800
Message-ID	<evc0h81jev28t7eiruno7iod56s13ln2vq@4ax.com>
In reply to	#21978

On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:

[snip]

>I think I must have set a new world record. 3 bugs in 12 characters.
>
>:-(

     I may be able to save your honour.  <G>

     IBM had bugs in a one-instruction program of two bytes long.  The
program was IEFBR14, and you can read about it on Wikipedia.  There
was a series of corrections which resulted in a program several times
larger.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#22098

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2013-02-05 00:03 +0000
Message-ID	<kepi95$g6u$1@localhost.localdomain>
In reply to	#22090

On Mon, 04 Feb 2013 14:26:50 -0800, Gene Wirchenko wrote:

> On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk> wrote:
> 
> [snip]
> 
>>I think I must have set a new world record. 3 bugs in 12 characters.
>>
>>:-(
> 
>      I may be able to save your honour.  <G>
> 
>      IBM had bugs in a one-instruction program of two bytes long.  The
> program was IEFBR14, and you can read about it on Wikipedia.  There was
> a series of corrections which resulted in a program several times
> larger.
>
Quite apart from the programmers needing a medal for the number of bugs 
they managed to write, there would seem to be at least two extra prizes 
the be awarded:

Parkinson Cup:      for the greatest expansion of a program without adding
                    any functionality.

Obscurantist Medal: to the assembler designer for creating the most
                    obscure assembler syntax I've ever seen. 

                    It easily beats Elliott 503 assembler (which had
                    no opcode mnemonics and KDF6 assembler, which 
                    didn't allow variable names.
                     
Thank $DEITY I managed to avoid using an S/360 or its lineal descendants.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#21984

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-02-02 00:08 +0100
Message-ID	<an307dFaqqU1@mid.individual.net>
In reply to	#21972

On 01.02.2013 21:14, Sebastian wrote:
> Am 31.01.2013 04:27, schrieb Arne Vajhøj:
>> On 1/30/2013 4:34 AM, Sebastian wrote:
>>> I want to match any sequence of characters, including line breaks, in a
>>> suffix of a multi-line string.
>>>
>>> I do not want to use Pattern.DOTALL, because line breaks are not
>>> permissible everywhere. I cannot write [.]* because dot loses its
>>> special meaning inside a character class.
>>>
>>> I have come up with [\S\s]*
>>> as meaning any sequence of non-whitespace or whitespace (incl.
>>> line-breaks). Is there a better way?

Yes.

>> Do you always want to accept line breaks or not? If not then when?

> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

Of course you can use DOTALL - as an embedded flag:

package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Dotty {

   private static final Pattern PAT =
     Pattern.compile("proto.*(?s:sel.*)");

   public static void main(String[] args) {
     test("protoPselS");
     test("protoPPselS\nS");
     test("protoP\nPselS\nS");
   }

   public static void test(final CharSequence cs) {
     System.out.println("cs=\"" + cs + "\"");
     final Matcher m = PAT.matcher(cs);

     if (m.matches()) {
       System.out.println("Match: \"" + m.group() + "\"");
     } else {
       System.out.println("Mismatch");
     }

     System.out.println();
   }

}

Kind regards

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [standalone]

csiph-web

Regex: Any character in character class

Contents

#21862 — Regex: Any character in character class

#21864

#21909

#21865

#21910

#21951

#21972

#21973

#21974

#21977

#21978

#22020

#22021

#22025

#22026

#22033

#22090

#22098

#21984