Re: Regular expressions

Date	2011-12-27 06:01 +0100
From	Fredrik Tolf <fredrik@dolda2000.com>
Subject	Re: Regular expressions
References	<495b6fe6-704a-42fc-b10b-484218ad8409@b20g2000pro.googlegroups.com> <mailman.4117.1324944010.27778.python-list@python.org> <4be34afe-4291-414b-9212-498074400e39@v24g2000prn.googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.4123.1324962099.27778.python-list@python.org> (permalink)

Show all headers | View raw

On Mon, 26 Dec 2011, mauriceling@acm.org wrote:
> I've tried
>
> re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
> 1:1105:6465:151103 1:N:0:')
>
> but it does not seems to work.

Indeed, for several reasons. First of all, your backslash sequences are 
interpreted by Python as string escapes. You'll need to write either "\\S" 
or r"\S" (the r, for raw, turns off backslash escapes).

Second, when you use only "\S", that matches a single non-space character, 
not several; you'll need to quantify them. "\S*" will match zero or more, 
"\S+" will match one or more, "\S?" will match zero or one, and there are 
a couple of other possibilities as well (see the manual for details). In 
this case, you probably want to use "+" for most of those.

Third, you're not marking the groups that you want to use in the 
replacement. Since you want to retain the entire string before the space, 
and the numeric element, you'll want to enclose them in parentheses to 
mark them as groups.

Fourth, your replacement string is entirely wacky. You don't use sequences 
such as "\S" and "\s" to refer back to groups in the original text, but 
numbered references, to refer back to parenthesized groups in the order 
they appear in the regex. In accordance what you seemed to want, you 
should probably use "@\1/\2" in your case ("\1" refers back to the first 
parentesized group, which you be the first "\S+" part, and "\2" to the 
second group, which should be the "[1-9]+" part; the at-mark and slash 
are inserted as they are into the result string).

Fifth, you'll probably want to match the last colon as well, in order not 
to retain it into the result string.

All in all, you will probably want to use something like this to correct 
that regex:

re.sub(r'@(\S+)\s([1-9]+):[A-N]+:[0-9]+:', r'@\1/\2',
        '@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:')

Also, you may be interested to know that you can use "\d" instead of 
"[0-9]".

--

Fredrik Tolf

Thread

Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 15:45 -0800
  Re: Regular expressions Chris Angelico <rosuav@gmail.com> - 2011-12-27 11:00 +1100
    Re: Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 16:15 -0800
      Re: Regular expressions Fredrik Tolf <fredrik@dolda2000.com> - 2011-12-27 06:01 +0100
        Re: Regular expressions rusi <rustompmody@gmail.com> - 2011-12-27 23:05 -0800
  Re: Regular expressions Roy Smith <roy@panix.com> - 2011-12-26 19:07 -0500
  Re: Regular expressions Jason Friedman <jason@powerpull.net> - 2011-12-27 00:16 +0000
    Re: Regular expressions "mauriceling@acm.org" <mauriceling@gmail.com> - 2011-12-26 16:24 -0800
      Re: Regular expressions Jason Friedman <jason@powerpull.net> - 2011-12-27 01:26 +0000

csiph-web