Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #21862 > unrolled thread

Regex: Any character in character class

Started bySebastian <news@seyweiler.dyndns.org>
First post2013-01-30 10:34 +0100
Last post2013-02-02 00:08 +0100
Articles 19 — 9 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-01-30 10:34 +0100
    Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:05 -0800
      Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:26 -0500
    Re: Regex: Any character in character class Mikhail Vladimirov <vladimirow@mail.ru> - 2013-01-30 02:07 -0800
    Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-01-30 22:27 -0500
      Re: Regex: Any character in character class Arved Sandstrom <asandstrom2@eastlink.ca> - 2013-02-01 05:35 -0400
      Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-01 21:14 +0100
        Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-01 12:54 -0800
        Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 16:47 -0500
          Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-01 14:06 -0800
            Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-01 17:13 -0500
              Re: Regex: Any character in character class Sebastian <news@seyweiler.dyndns.org> - 2013-02-02 20:45 +0100
                Re: Regex: Any character in character class markspace <markspace@nospam.nospam> - 2013-02-02 12:20 -0800
                Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 16:03 -0500
                  Re: Regex: Any character in character class Lew <lewbloch@gmail.com> - 2013-02-02 13:23 -0800
                    Re: Regex: Any character in character class Arne Vajhøj <arne@vajhoej.dk> - 2013-02-02 20:18 -0500
              Re: Regex: Any character in character class Gene Wirchenko <genew@telus.net> - 2013-02-04 14:26 -0800
                Re: Regex: Any character in character class Martin Gregorie <martin@address-in-sig.invalid> - 2013-02-05 00:03 +0000
        Re: Regex: Any character in character class Robert Klemme <shortcutter@googlemail.com> - 2013-02-02 00:08 +0100

#21862 — Regex: Any character in character class

FromSebastian <news@seyweiler.dyndns.org>
Date2013-01-30 10:34 +0100
SubjectRegex: Any character in character class
Message-ID<keapdj$aqh$1@news.albasani.net>
I want to match any sequence of characters, including line breaks, in a 
suffix of a multi-line string.

I do not want to use Pattern.DOTALL, because line breaks are not 
permissible everywhere. I cannot write [.]* because dot loses its 
special meaning inside a character class.

I have come up with [\S\s]*
as meaning any sequence of non-whitespace or whitespace (incl. 
line-breaks). Is there a better way?

-- Sebastian

[toc] | [next] | [standalone]


#21864

FromMikhail Vladimirov <vladimirow@mail.ru>
Date2013-01-30 02:05 -0800
Message-ID<bf443cc2-1b97-4e28-8dce-60690fe1905a@googlegroups.com>
In reply to#21862
What about [^]?

[toc] | [prev] | [next] | [standalone]


#21909

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-01-30 22:26 -0500
Message-ID<5109e465$0$295$14726298@news.sunsite.dk>
In reply to#21864
On 1/30/2013 5:05 AM, Mikhail Vladimirov wrote:
> What about [^]?

java.util.regex.PatternSyntaxException

Arne

[toc] | [prev] | [next] | [standalone]


#21865

FromMikhail Vladimirov <vladimirow@mail.ru>
Date2013-01-30 02:07 -0800
Message-ID<274bdf05-9273-44f9-aacd-1aee5fdf91dc@googlegroups.com>
In reply to#21862
Another option is .|\n

[toc] | [prev] | [next] | [standalone]


#21910

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-01-30 22:27 -0500
Message-ID<5109e49b$0$295$14726298@news.sunsite.dk>
In reply to#21862
On 1/30/2013 4:34 AM, Sebastian wrote:
> I want to match any sequence of characters, including line breaks, in a
> suffix of a multi-line string.
>
> I do not want to use Pattern.DOTALL, because line breaks are not
> permissible everywhere. I cannot write [.]* because dot loses its
> special meaning inside a character class.
>
> I have come up with [\S\s]*
> as meaning any sequence of non-whitespace or whitespace (incl.
> line-breaks). Is there a better way?

Do you always want to accept line breaks or not? If not then when?

Arne

[toc] | [prev] | [next] | [standalone]


#21951

FromArved Sandstrom <asandstrom2@eastlink.ca>
Date2013-02-01 05:35 -0400
Message-ID<I%LOs.8145$Sq4.8135@newsfe14.iad>
In reply to#21910
On 01/30/2013 11:27 PM, Arne Vajhøj wrote:
> On 1/30/2013 4:34 AM, Sebastian wrote:
>> I want to match any sequence of characters, including line breaks, in a
>> suffix of a multi-line string.
>>
>> I do not want to use Pattern.DOTALL, because line breaks are not
>> permissible everywhere. I cannot write [.]* because dot loses its
>> special meaning inside a character class.
>>
>> I have come up with [\S\s]*
>> as meaning any sequence of non-whitespace or whitespace (incl.
>> line-breaks). Is there a better way?
>
> Do you always want to accept line breaks or not? If not then when?
>
> Arne
>
>
Good question.

I take it the suffix is a generic last-N characters of the string 
(Assumption #1). I take it that line breaks are OK in the suffix, not 
necessarily so in the rest of the string (Assumption #2).

If you don't mind me asking, why don't you just grab the suffix, the 
last N characters, with substring()? That *is* your match.

AHS

[toc] | [prev] | [next] | [standalone]


#21972

FromSebastian <news@seyweiler.dyndns.org>
Date2013-02-01 21:14 +0100
Message-ID<keh7mp$pjm$1@news.albasani.net>
In reply to#21910
Am 31.01.2013 04:27, schrieb Arne Vajhøj:
> On 1/30/2013 4:34 AM, Sebastian wrote:
>> I want to match any sequence of characters, including line breaks, in a
>> suffix of a multi-line string.
>>
>> I do not want to use Pattern.DOTALL, because line breaks are not
>> permissible everywhere. I cannot write [.]* because dot loses its
>> special meaning inside a character class.
>>
>> I have come up with [\S\s]*
>> as meaning any sequence of non-whitespace or whitespace (incl.
>> line-breaks). Is there a better way?
>
> Do you always want to accept line breaks or not? If not then when?
>
> Arne
>
>
the string I want to match basicallyhas two parts (a "protocol" and a
"selection expression"). I want to allow line breaks anywhere in the
selection expression, but not in the protocol.
-- S.

[toc] | [prev] | [next] | [standalone]


#21973

FromLew <lewbloch@gmail.com>
Date2013-02-01 12:54 -0800
Message-ID<9463f7ba-413e-4604-bbf0-a5a7f5c9984d@googlegroups.com>
In reply to#21972
Sebastian wrote:
> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

How do you tell which part is which?

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#21974

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-02-01 16:47 -0500
Message-ID<510c380b$0$287$14726298@news.sunsite.dk>
In reply to#21972
On 2/1/2013 3:14 PM, Sebastian wrote:
> Am 31.01.2013 04:27, schrieb Arne Vajhøj:
>> On 1/30/2013 4:34 AM, Sebastian wrote:
>>> I want to match any sequence of characters, including line breaks, in a
>>> suffix of a multi-line string.
>>>
>>> I do not want to use Pattern.DOTALL, because line breaks are not
>>> permissible everywhere. I cannot write [.]* because dot loses its
>>> special meaning inside a character class.
>>>
>>> I have come up with [\S\s]*
>>> as meaning any sequence of non-whitespace or whitespace (incl.
>>> line-breaks). Is there a better way?
>>
>> Do you always want to accept line breaks or not? If not then when?

> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

Do you have a separator between the two parts like colon in URL's?

If yes then something like:

[.]+:[.|\n]+

Arne

[toc] | [prev] | [next] | [standalone]


#21977

Frommarkspace <markspace@nospam.nospam>
Date2013-02-01 14:06 -0800
Message-ID<kehe9j$otg$1@dont-email.me>
In reply to#21974
On 2/1/2013 1:47 PM, Arne Vajhøj wrote:

> [.]+:[.|\n]+


Watch out for this.  +, being greedy, will match a : in the selection 
expression (the 2nd part) if : is allowed in the second part.

The reluctant modifier might be a better idea here:

.+?:[.|\n]+

Note that I don't think the initial brackets [] were needed.  Also we're 
yet again starting to see the problem with regex: it always evolves into 
something that looks like your cat walked across the keyboard.

[toc] | [prev] | [next] | [standalone]


#21978

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-02-01 17:13 -0500
Message-ID<510c3e29$0$287$14726298@news.sunsite.dk>
In reply to#21977
On 2/1/2013 5:06 PM, markspace wrote:
> On 2/1/2013 1:47 PM, Arne Vajhøj wrote:
>
>> [.]+:[.|\n]+
>
>
> Watch out for this.  +, being greedy, will match a : in the selection
> expression (the 2nd part) if : is allowed in the second part.
>
> The reluctant modifier might be a better idea here:
>
> .+?:[.|\n]+
>
> Note that I don't think the initial brackets [] were needed.  Also we're
> yet again starting to see the problem with regex: it always evolves into
> something that looks like your cat walked across the keyboard.

You are absolutely right.

Non greedy.

No square brackets for first part.

And also round brackets for the last part.

.+?:(.|\n)+

I think I must have set a new world record. 3 bugs in 12 characters.

:-(

Arne

[toc] | [prev] | [next] | [standalone]


#22020

FromSebastian <news@seyweiler.dyndns.org>
Date2013-02-02 20:45 +0100
Message-ID<kejqcn$lnc$1@news.albasani.net>
In reply to#21978
Am 01.02.2013 23:13, schrieb Arne Vajhøj:
[snip]
> And also round brackets for the last part.
>
> .+?:(.|\n)+
>
> I think I must have set a new world record. 3 bugs in 12 characters.
>
> :-(
>
> Arne
>
Here's a concrete example:

SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
company:company]


The second part is everything after the first comma. I was using
(.+?),[\s\S]+

Arne's suggestion modified for my needs (comma as separator, and I only 
want to capture the first part as a group) will work fine as well:
(.+?),(?:.|\n)+

Can't say though that I find anything to prefer the one to the other.
Perhaps the second looks even more like the result of a cat walk...

-- Sebastian

[toc] | [prev] | [next] | [standalone]


#22021

Frommarkspace <markspace@nospam.nospam>
Date2013-02-02 12:20 -0800
Message-ID<kejsdv$ut0$1@dont-email.me>
In reply to#22020
On 2/2/2013 11:45 AM, Sebastian wrote:
> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
> company:company]

For something this simple you might want to consider just String::split().

       String test =
"SCA:LIST,select[werks_s:default_plant],values[bukrs:bukrs,company:company] 
";
       String[] parse = test.split( ",\\s*", 2 );
       System.out.println( Arrays.toString( parse ) );

This could be faster since the second half of the regex, (?:.|\n)+, 
doesn't have to execute.

[toc] | [prev] | [next] | [standalone]


#22025

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-02-02 16:03 -0500
Message-ID<510d7f3c$0$289$14726298@news.sunsite.dk>
In reply to#22020
On 2/2/2013 2:45 PM, Sebastian wrote:
> Am 01.02.2013 23:13, schrieb Arne Vajhøj:
> [snip]
>> And also round brackets for the last part.
>>
>> .+?:(.|\n)+
>>
>> I think I must have set a new world record. 3 bugs in 12 characters.
>>
>> :-(
>>
> Here's a concrete example:
>
> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
> company:company]
>
>
> The second part is everything after the first comma. I was using
> (.+?),[\s\S]+
>
> Arne's suggestion modified for my needs (comma as separator, and I only
> want to capture the first part as a group) will work fine as well:
> (.+?),(?:.|\n)+
>
> Can't say though that I find anything to prefer the one to the other.
> Perhaps the second looks even more like the result of a cat walk...

It is not unusual that there is more than one regex that
does the job.

Arne

[toc] | [prev] | [next] | [standalone]


#22026

FromLew <lewbloch@gmail.com>
Date2013-02-02 13:23 -0800
Message-ID<e7fe2ed3-a818-4d4d-8744-a4ddfcd71909@googlegroups.com>
In reply to#22025
Arne Vajhøj wrote:
> Sebastian wrote:
>> schrieb Arne Vajhᅵj:
>> [snip]
>>> And also round brackets for the last part.
>>>
>>> .+?:(.|\n)+
>>>
>>> I think I must have set a new world record. 3 bugs in 12 characters.
> 
>>> :-(
> 
>> Here's a concrete example:
>>
>> SCA:LIST, select[werks_s:default_plant],values[bukrs:bukrs,
>> company:company]
> 
>> The second part is everything after the first comma. I was using

You mean 'expression.substring(expression.indexOf(',') + 1)'?
(modulo the usual error checks, of course)

> > (.+?),[\s\S]+

>> Arne's suggestion modified for my needs (comma as separator, and I only
>> want to capture the first part as a group) will work fine as well:

You mean 'expression.substring(0, expression.indexOf(','))'?

> > (.+?),(?:.|\n)+
> 
>> Can't say though that I find anything to prefer the one to the other.
>> Perhaps the second looks even more like the result of a cat walk...

If all you need to do is split a string on a comma, why use regexes at all?

> It is not unusual that there is more than one regex that
> does the job.

It is not unusual that there is more than one non-regex that does the job.

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#22033

FromArne Vajhøj <arne@vajhoej.dk>
Date2013-02-02 20:18 -0500
Message-ID<510dbb00$0$295$14726298@news.sunsite.dk>
In reply to#22026
On 2/2/2013 4:23 PM, Lew wrote:
> Arne Vajhøj wrote:
>> Sebastian wrote:
>>> Can't say though that I find anything to prefer the one to the other.
>>> Perhaps the second looks even more like the result of a cat walk...
>
> If all you need to do is split a string on a comma, why use regexes at all?
>
>> It is not unusual that there is more than one regex that
>> does the job.
>
> It is not unusual that there is more than one non-regex that does the job.

True.

But less surprising.

Arne

[toc] | [prev] | [next] | [standalone]


#22090

FromGene Wirchenko <genew@telus.net>
Date2013-02-04 14:26 -0800
Message-ID<evc0h81jev28t7eiruno7iod56s13ln2vq@4ax.com>
In reply to#21978
On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:

[snip]

>I think I must have set a new world record. 3 bugs in 12 characters.
>
>:-(

     I may be able to save your honour.  <G>

     IBM had bugs in a one-instruction program of two bytes long.  The
program was IEFBR14, and you can read about it on Wikipedia.  There
was a series of corrections which resulted in a program several times
larger.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]


#22098

FromMartin Gregorie <martin@address-in-sig.invalid>
Date2013-02-05 00:03 +0000
Message-ID<kepi95$g6u$1@localhost.localdomain>
In reply to#22090
On Mon, 04 Feb 2013 14:26:50 -0800, Gene Wirchenko wrote:

> On Fri, 01 Feb 2013 17:13:54 -0500, Arne Vajhøj <arne@vajhoej.dk> wrote:
> 
> [snip]
> 
>>I think I must have set a new world record. 3 bugs in 12 characters.
>>
>>:-(
> 
>      I may be able to save your honour.  <G>
> 
>      IBM had bugs in a one-instruction program of two bytes long.  The
> program was IEFBR14, and you can read about it on Wikipedia.  There was
> a series of corrections which resulted in a program several times
> larger.
>
Quite apart from the programmers needing a medal for the number of bugs 
they managed to write, there would seem to be at least two extra prizes 
the be awarded:

Parkinson Cup:      for the greatest expansion of a program without adding
                    any functionality.

Obscurantist Medal: to the assembler designer for creating the most
                    obscure assembler syntax I've ever seen. 

                    It easily beats Elliott 503 assembler (which had
                    no opcode mnemonics and KDF6 assembler, which 
                    didn't allow variable names.
                     
Thank $DEITY I managed to avoid using an S/360 or its lineal descendants.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]


#21984

FromRobert Klemme <shortcutter@googlemail.com>
Date2013-02-02 00:08 +0100
Message-ID<an307dFaqqU1@mid.individual.net>
In reply to#21972
On 01.02.2013 21:14, Sebastian wrote:
> Am 31.01.2013 04:27, schrieb Arne Vajhøj:
>> On 1/30/2013 4:34 AM, Sebastian wrote:
>>> I want to match any sequence of characters, including line breaks, in a
>>> suffix of a multi-line string.
>>>
>>> I do not want to use Pattern.DOTALL, because line breaks are not
>>> permissible everywhere. I cannot write [.]* because dot loses its
>>> special meaning inside a character class.
>>>
>>> I have come up with [\S\s]*
>>> as meaning any sequence of non-whitespace or whitespace (incl.
>>> line-breaks). Is there a better way?

Yes.

>> Do you always want to accept line breaks or not? If not then when?

> the string I want to match basicallyhas two parts (a "protocol" and a
> "selection expression"). I want to allow line breaks anywhere in the
> selection expression, but not in the protocol.

Of course you can use DOTALL - as an embedded flag:

package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Dotty {

   private static final Pattern PAT =
     Pattern.compile("proto.*(?s:sel.*)");

   public static void main(String[] args) {
     test("protoPselS");
     test("protoPPselS\nS");
     test("protoP\nPselS\nS");
   }

   public static void test(final CharSequence cs) {
     System.out.println("cs=\"" + cs + "\"");
     final Matcher m = PAT.matcher(cs);

     if (m.matches()) {
       System.out.println("Match: \"" + m.group() + "\"");
     } else {
       System.out.println("Mismatch");
     }

     System.out.println();
   }

}

Kind regards

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.programmer


csiph-web