Groups > comp.lang.python > #17993 > unrolled thread

Regular expressions

Started by	"mauriceling@acm.org" <mauriceling@gmail.com>
First post	2011-12-26 15:45 -0800
Last post	2011-12-27 01:26 +0000
Articles	9 — 6 participants

Back to article view | Back to comp.lang.python

  Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 15:45 -0800
    Re: Regular expressions Chris Angelico <rosuav@gmail.com> - 2011-12-27 11:00 +1100
      Re: Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 16:15 -0800
        Re: Regular expressions Fredrik Tolf <fredrik@dolda2000.com> - 2011-12-27 06:01 +0100
          Re: Regular expressions rusi <rustompmody@gmail.com> - 2011-12-27 23:05 -0800
    Re: Regular expressions Roy Smith <roy@panix.com> - 2011-12-26 19:07 -0500
    Re: Regular expressions Jason Friedman <jason@powerpull.net> - 2011-12-27 00:16 +0000
      Re: Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 16:24 -0800
        Re: Regular expressions Jason Friedman <jason@powerpull.net> - 2011-12-27 01:26 +0000

#17993 — Regular expressions

From	"mauriceling@acm.org" <mauriceling@gmail.com>
Date	2011-12-26 15:45 -0800
Subject	Regular expressions
Message-ID	<495b6fe6-704a-42fc-b10b-484218ad8409@b20g2000pro.googlegroups.com>

Hi

I am trying to change "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:
0:" to "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1".

Can anyone help me with the regular expressions needed?

Thanks in advance.

Maurice

[toc] | [next] | [standalone]

#17994

From	Chris Angelico <rosuav@gmail.com>
Date	2011-12-27 11:00 +1100
Message-ID	<mailman.4117.1324944010.27778.python-list@python.org>
In reply to	#17993

On Tue, Dec 27, 2011 at 10:45 AM, mauriceling@acm.org
<mauriceling@gmail.com> wrote:
> Hi
>
> I am trying to change <one string> to <another string>.
>
> Can anyone help me with the regular expressions needed?

A regular expression defines a string based on rules. Without seeing a
lot more strings, we can't know what possibilities there are for each
part of the string. You probably know your data better than we ever
will, even eyeballing the entire set of strings; just write down, in
order, what the pieces ought to be - for instance, the first token
might be a literal @ sign, followed by three upper-case letters, then
a hyphen, then any number of alphanumerics followed by a colon, etc.
Once you have that, it's fairly straightforward to translate that into
regex syntax.

ChrisA

[toc] | [prev] | [next] | [standalone]

#17996

From	"mauriceling@acm.org" <mauriceling@gmail.com>
Date	2011-12-26 16:15 -0800
Message-ID	<4be34afe-4291-414b-9212-498074400e39@v24g2000prn.googlegroups.com>
In reply to	#17994

On Dec 27, 8:00 am, Chris Angelico <ros...@gmail.com> wrote:
> On Tue, Dec 27, 2011 at 10:45 AM, mauricel...@acm.org
>
> <mauricel...@gmail.com> wrote:
> > Hi
>
> > I am trying to change <one string> to <another string>.
>
> > Can anyone help me with the regular expressions needed?
>
> A regular expression defines a string based on rules. Without seeing a
> lot more strings, we can't know what possibilities there are for each
> part of the string. You probably know your data better than we ever
> will, even eyeballing the entire set of strings; just write down, in
> order, what the pieces ought to be - for instance, the first token
> might be a literal @ sign, followed by three upper-case letters, then
> a hyphen, then any number of alphanumerics followed by a colon, etc.
> Once you have that, it's fairly straightforward to translate that into
> regex syntax.
>
> ChrisA

I've tried

re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
1:1105:6465:151103 1:N:0:')

but it does not seems to work.

[toc] | [prev] | [next] | [standalone]

#18006

From	Fredrik Tolf <fredrik@dolda2000.com>
Date	2011-12-27 06:01 +0100
Message-ID	<mailman.4123.1324962099.27778.python-list@python.org>
In reply to	#17996

On Mon, 26 Dec 2011, mauriceling@acm.org wrote:
> I've tried
>
> re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
> 1:1105:6465:151103 1:N:0:')
>
> but it does not seems to work.

Indeed, for several reasons. First of all, your backslash sequences are 
interpreted by Python as string escapes. You'll need to write either "\\S" 
or r"\S" (the r, for raw, turns off backslash escapes).

Second, when you use only "\S", that matches a single non-space character, 
not several; you'll need to quantify them. "\S*" will match zero or more, 
"\S+" will match one or more, "\S?" will match zero or one, and there are 
a couple of other possibilities as well (see the manual for details). In 
this case, you probably want to use "+" for most of those.

Third, you're not marking the groups that you want to use in the 
replacement. Since you want to retain the entire string before the space, 
and the numeric element, you'll want to enclose them in parentheses to 
mark them as groups.

Fourth, your replacement string is entirely wacky. You don't use sequences 
such as "\S" and "\s" to refer back to groups in the original text, but 
numbered references, to refer back to parenthesized groups in the order 
they appear in the regex. In accordance what you seemed to want, you 
should probably use "@\1/\2" in your case ("\1" refers back to the first 
parentesized group, which you be the first "\S+" part, and "\2" to the 
second group, which should be the "[1-9]+" part; the at-mark and slash 
are inserted as they are into the result string).

Fifth, you'll probably want to match the last colon as well, in order not 
to retain it into the result string.

All in all, you will probably want to use something like this to correct 
that regex:

re.sub(r'@(\S+)\s([1-9]+):[A-N]+:[0-9]+:', r'@\1/\2',
        '@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:')

Also, you may be interested to know that you can use "\d" instead of 
"[0-9]".

--

Fredrik Tolf

[toc] | [prev] | [next] | [standalone]

#18105

From	rusi <rustompmody@gmail.com>
Date	2011-12-27 23:05 -0800
Message-ID	<675670e4-a954-46f3-a90d-cd4c5934ffec@h7g2000prn.googlegroups.com>
In reply to	#18006

On Dec 27, 10:01 am, Fredrik Tolf <fred...@dolda2000.com> wrote:
> On Mon, 26 Dec 2011, mauricel...@acm.org wrote:
> > I've tried
>
> > re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
> > 1:1105:6465:151103 1:N:0:')
>
> > but it does not seems to work.
>
> Indeed, for several reasons. First of all, your backslash sequences are
> interpreted by Python as string escapes. You'll need to write either "\\S"
> or r"\S" (the r, for raw, turns off backslash escapes).
>
> Second, when you use only "\S", that matches a single non-space character,
> not several; you'll need to quantify them. "\S*" will match zero or more,
> "\S+" will match one or more, "\S?" will match zero or one, and there are
> a couple of other possibilities as well (see the manual for details). In
> this case, you probably want to use "+" for most of those.
>
> Third, you're not marking the groups that you want to use in the
> replacement. Since you want to retain the entire string before the space,
> and the numeric element, you'll want to enclose them in parentheses to
> mark them as groups.
>
> Fourth, your replacement string is entirely wacky. You don't use sequences
> such as "\S" and "\s" to refer back to groups in the original text, but
> numbered references, to refer back to parenthesized groups in the order
> they appear in the regex. In accordance what you seemed to want, you
> should probably use "@\1/\2" in your case ("\1" refers back to the first
> parentesized group, which you be the first "\S+" part, and "\2" to the
> second group, which should be the "[1-9]+" part; the at-mark and slash
> are inserted as they are into the result string).
>
> Fifth, you'll probably want to match the last colon as well, in order not
> to retain it into the result string.
>
> All in all, you will probably want to use something like this to correct
> that regex:
>
> re.sub(r'@(\S+)\s([1-9]+):[A-N]+:[0-9]+:', r'@\1/\2',
>         '@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:')
>
> Also, you may be interested to know that you can use "\d" instead of
> "[0-9]".
>
> --
>
> Fredrik Tolf

For practical 'get-the-hands-dirty' experience look at

python-specific:  http://kodos.sourceforge.net/
Online: http://gskinner.com/RegExr/
emacs-specific: re-builder and regex-tool http://bc.tech.coop/blog/071103.html

[toc] | [prev] | [next] | [standalone]

#17995

From	Roy Smith <roy@panix.com>
Date	2011-12-26 19:07 -0500
Message-ID	<roy-1ED2CF.19074926122011@news.panix.com>
In reply to	#17993

In article 
<495b6fe6-704a-42fc-b10b-484218ad8409@b20g2000pro.googlegroups.com>,
 "mauriceling@acm.org" <mauriceling@gmail.com> wrote:

> Hi
> 
> I am trying to change "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:
> 0:" to "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1".
> 
> Can anyone help me with the regular expressions needed?

Easy-peasy:

import re
input = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N: 0:"
output = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1"
pattern = re.compile(
    r'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N: 0:')
out = pattern.sub(
    r'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1',
    input)
assert out == output

To be honest, I wouldn't do this with a regex.  I'm not quite sure what 
you're trying to do, but I'm guessing it's something like "Get 
everything after the first space in the string; keep just the integer 
that's before the first ':' in that and turn the space into a slash".  
In that case, I'd do something like:

head, tail = input.split(' ', 1)
number, _ = tail.split(':')
print "%s/%s" % (head, number)

[toc] | [prev] | [next] | [standalone]

#17997

From	Jason Friedman <jason@powerpull.net>
Date	2011-12-27 00:16 +0000
Message-ID	<mailman.4119.1324945014.27778.python-list@python.org>
In reply to	#17993

> On Tue, Dec 27, 2011 at 10:45 AM, mauriceling@acm.org
> <mauriceling@gmail.com> wrote:
>> Hi
>>
>> I am trying to change <one string> to <another string>.
>>
>> Can anyone help me with the regular expressions needed?
>
> A regular expression defines a string based on rules. Without seeing a
> lot more strings, we can't know what possibilities there are for each
> part of the string. You probably know your data better than we ever
> will, even eyeballing the entire set of strings; just write down, in
> order, what the pieces ought to be - for instance, the first token
> might be a literal @ sign, followed by three upper-case letters, then
> a hyphen, then any number of alphanumerics followed by a colon, etc.
> Once you have that, it's fairly straightforward to translate that into
> regex syntax.
>
> ChrisA
> --
> http://mail.python.org/mailman/listinfo/python-list

The OP told me, off list, that my guess was true:

> Can we say that your string:
> 1) Contains 7 colon-delimited fields, followed by
> 2) whitespace, followed by
> 3) 3 colon-delimited fields (A, B, C), followed by
> 4) a colon?
> The transformation needed is that the whitespace is replaced by a
> slash, the "A" characters are taken as is, and the colons and fields
> following the "A" characters are eliminated?

Doubtful that my guess was 100% accurate, but nevertheless:

>>> import re
>>> string1 = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:"
>>> re.sub(r"(\S+)\s+(\S+?):.+", "\g<1>/\g<2>", string1)
'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1'

[toc] | [prev] | [next] | [standalone]

#17998

From	"mauriceling@acm.org" <mauriceling@gmail.com>
Date	2011-12-26 16:24 -0800
Message-ID	<34666b77-9e7a-4093-b7c4-74adeceb6555@l18g2000pro.googlegroups.com>
In reply to	#17997

On Dec 27, 8:16 am, Jason Friedman <ja...@powerpull.net> wrote:
> > On Tue, Dec 27, 2011 at 10:45 AM, mauricel...@acm.org
> > <mauricel...@gmail.com> wrote:
> >> Hi
>
> >> I am trying to change <one string> to <another string>.
>
> >> Can anyone help me with the regular expressions needed?
>
> > A regular expression defines a string based on rules. Without seeing a
> > lot more strings, we can't know what possibilities there are for each
> > part of the string. You probably know your data better than we ever
> > will, even eyeballing the entire set of strings; just write down, in
> > order, what the pieces ought to be - for instance, the first token
> > might be a literal @ sign, followed by three upper-case letters, then
> > a hyphen, then any number of alphanumerics followed by a colon, etc.
> > Once you have that, it's fairly straightforward to translate that into
> > regex syntax.
>
> > ChrisA
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> The OP told me, off list, that my guess was true:
>
> > Can we say that your string:
> > 1) Contains 7 colon-delimited fields, followed by
> > 2) whitespace, followed by
> > 3) 3 colon-delimited fields (A, B, C), followed by
> > 4) a colon?
> > The transformation needed is that the whitespace is replaced by a
> > slash, the "A" characters are taken as is, and the colons and fields
> > following the "A" characters are eliminated?
>
> Doubtful that my guess was 100% accurate, but nevertheless:
>
> >>> import re
> >>> string1 = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:"
> >>> re.sub(r"(\S+)\s+(\S+?):.+", "\g<1>/\g<2>", string1)
>
> '@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1'

Thanks a lot everyone.

Can anyone suggest a good place to learn REs?

ML

[toc] | [prev] | [next] | [standalone]

#18002

From	Jason Friedman <jason@powerpull.net>
Date	2011-12-27 01:26 +0000
Message-ID	<mailman.4121.1324949168.27778.python-list@python.org>
In reply to	#17998

> Thanks a lot everyone.
>
> Can anyone suggest a good place to learn REs?

Start with the manual:
http://docs.python.org/py3k/library/re.html#module-re

[toc] | [prev] | [standalone]

csiph-web

Regular expressions

Contents

#17993 — Regular expressions

#17994

#17996

#18006

#18105

#17995

#17997

#17998

#18002