Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #36153 > unrolled thread

Need a specific sort of string modification. Can someone help?

Started bySia <hossein.asgharian@gmail.com>
First post2013-01-05 00:35 -0800
Last post2013-01-06 23:07 -0800
Articles 5 on this page of 25 — 10 participants

Back to article view | Back to comp.lang.python


Contents

  Need a specific sort of string modification. Can someone help? Sia <hossein.asgharian@gmail.com> - 2013-01-05 00:35 -0800
    Re: Need a specific sort of string modification. Can someone help? Frank Millman <frank@chagford.com> - 2013-01-05 11:15 +0200
    Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-05 20:27 +1100
      Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 09:30 -0500
        Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 01:47 +1100
          Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 10:03 -0500
            Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 02:09 +1100
              Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 10:38 -0500
                Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 02:57 +1100
                Re: Need a specific sort of string modification. Can someone help? Ian Kelly <ian.g.kelly@gmail.com> - 2013-01-05 13:04 -0700
                Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 07:32 +1100
                  Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 15:47 -0500
                    Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-06 12:28 -0500
                      Re: Need a specific sort of string modification. Can someone help? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-06 23:19 +0000
    Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 09:12 -0500
    Re: Need a specific sort of string modification. Can someone help? Tim Chase <python.list@tim.thechases.com> - 2013-01-05 11:24 -0600
    Re: Need a specific sort of string modification. Can someone help? Tim Chase <python.list@tim.thechases.com> - 2013-01-05 12:49 -0600
    Re: Need a specific sort of string modification. Can someone help? Mitya Sirenef <msirenef@lightbird.net> - 2013-01-06 01:32 -0500
    Re: Need a specific sort of string modification. Can someone help? Mitya Sirenef <msirenef@lightbird.net> - 2013-01-06 14:53 -0500
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 18:48 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 19:40 -0800
      Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 21:28 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 21:30 -0800
    Re: Need a specific sort of string modification. Can someone help? John Ladasky <john_ladasky@sbcglobal.net> - 2013-01-06 21:39 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 23:07 -0800

Page 2 of 2 — ← Prev page 1 [2]


#36320

FromNick Mellor <thebalancepro@gmail.com>
Date2013-01-06 19:40 -0800
Message-ID<18ae74e7-780d-48d6-9575-b5f85541edfe@googlegroups.com>
In reply to#36153
Hi Sia,

Find a multi-digit method in this version:

from string import maketrans
from itertools import takewhile

def is_digit(s): return s.isdigit()

class redux:

    def __init__(self):
       intab = '+-'
       outtab = '  '
       self.trantab = maketrans(intab, outtab)


    def reduce_plusminus(self, s):
        list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
                    for r
                    in s.translate(self.trantab).split()]
        return ''.join(list_form)

    def reduce_plusminus_multi_digit(self, s):
        spl = s.translate(self.trantab).split()
        digits = [list(takewhile(is_digit, r))
                   for r
                   in spl]
        numbers = [int(''.join(r)) if r else 0
                   for r
                    in digits]
        skips = [len(dig) + num for dig, num in zip(digits, numbers)]
        return ''.join([s[r:] for r, s in zip(skips, spl)])

if __name__ == "__main__":
    p = redux()
    print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
    print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")
    print 'multi-digit...'
    print p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG")
    print p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")


HTH,

Nick

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.

[toc] | [prev] | [next] | [standalone]


#36322

FromNick Mellor <thebalancepro@gmail.com>
Date2013-01-06 21:28 -0800
Message-ID<101cfe8d-a1e8-4491-8b15-42c4927ccebb@googlegroups.com>
In reply to#36320
Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters.

Brief explanation of the multi-digit version:

+/- are converted to spaces and used to split the string into sections. The split process effectively swallows the +/- characters.

The complication of multi-digits is that you need to skip the (possibly multiple) digits, which adds another stage to the calculation. In:

+3ACG. -> .

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In:

-11ACGACGACGACG. -> G.

You skip 2 + 11 characters, 2 digits in "12" and 11 letters following. And incidentally in:

+ACG. -> ACG.

there's no digit, so you skip 0 digits + 0 letters.

Having split on +/- using .translate() and .split() I use takewhile to separate the zero or more digits from the following letters. If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so dropwhile and ''.join() cover the no-digit case between them. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in 'digits'.

I was pleasantly surprised to find that using list comprehensions, zip, join (all highly optimised in Python) and several intermediate lists still works at a fairly decent speed, despite using more stages to handle multi-digits. But it is about 4x slower than the less flexible 1-digit version on my hardware (about 25,000 per second.)

Nick

On Monday, 7 January 2013 14:40:02 UTC+11, Nick Mellor  wrote:
> Hi Sia,
> 
> 
> 
> Find a multi-digit method in this version:
> 
> 
> 
> from string import maketrans
> 
> from itertools import takewhile
> 
> 
> 
> def is_digit(s): return s.isdigit()
> 
> 
> 
> class redux:
> 
> 
> 
>     def __init__(self):
> 
>        intab = '+-'
> 
>        outtab = '  '
> 
>        self.trantab = maketrans(intab, outtab)
> 
> 
> 
> 
> 
>     def reduce_plusminus(self, s):
> 
>         list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
> 
>                     for r
> 
>                     in s.translate(self.trantab).split()]
> 
>         return ''.join(list_form)
> 
> 
> 
>     def reduce_plusminus_multi_digit(self, s):
> 
>         spl = s.translate(self.trantab).split()
> 
>         digits = [list(takewhile(is_digit, r))
> 
>                    for r
> 
>                    in spl]
> 
>         numbers = [int(''.join(r)) if r else 0
> 
>                    for r
> 
>                     in digits]
> 
>         skips = [len(dig) + num for dig, num in zip(digits, numbers)]
> 
>         return ''.join([s[r:] for r, s in zip(skips, spl)])
> 
> 
> 
> if __name__ == "__main__":
> 
>     p = redux()
> 
>     print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
> 
>     print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")
> 
>     print 'multi-digit...'
> 
>     print p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG")
> 
>     print p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")
> 
> 
> 
> 
> 
> HTH,
> 
> 
> 
> Nick
> 
> 
> 
> On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> 
> > I have strings such as:
> 
> > 
> 
> > 
> 
> > 
> 
> > tA.-2AG.-2AG,-2ag
> 
> > 
> 
> > or
> 
> > 
> 
> > .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> > 
> 
> > 
> 
> > 
> 
> > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> > 
> 
> > 
> 
> > 
> 
> > tA..,
> 
> > 
> 
> > and
> 
> > 
> 
> > ...
> 
> > 
> 
> > 
> 
> > 
> 
> > How can I do that?
> 
> > 
> 
> > Thanks.

[toc] | [prev] | [next] | [standalone]


#36323

FromNick Mellor <thebalancepro@gmail.com>
Date2013-01-06 21:30 -0800
Message-ID<fe23a0f7-1625-4215-9ee5-955c53532abc@googlegroups.com>
In reply to#36153
Oops!

"You skip 2 + 11 characters, 2 digits in "12" and 11 letters following. And incidentally in: "

should read:

"You skip 2 + 11 characters, 2 digits in "11" and 11 letters following. And incidentally in: "

N

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.

[toc] | [prev] | [next] | [standalone]


#36324

FromJohn Ladasky <john_ladasky@sbcglobal.net>
Date2013-01-06 21:39 -0800
Message-ID<0781b010-1640-43dd-ab67-0a1055db61bc@googlegroups.com>
In reply to#36153
On Saturday, January 5, 2013 12:35:26 AM UTC-8, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG

Just curious, do these strings represent DNA sequences?

[toc] | [prev] | [next] | [standalone]


#36327

FromNick Mellor <thebalancepro@gmail.com>
Date2013-01-06 23:07 -0800
Message-ID<8a524329-3c3f-4575-8296-16f9567e1202@googlegroups.com>
In reply to#36153
Hi Sia, 

Thanks for the problem! I hope you find these examples understandable.

Below, find an inflexible but fairly fast single-digit method and a slower (but still respectable) multi-digit method that copes with entirely absent digits after +/- and multi-digit skips such as 12 or 37 or 186 following letters rather than just 3 or 5 or 9. Not knowing your problem domain I'm not sure if these larger skips are likely or even possible, nor whether it's likely that there will be no digit.

Neither method flags the edge case where there are not enough letters to skip before the next +/-. They just swallow that chunk of the string entirely without error. Is that a problem?

Python 2.7 or later.

from string import maketrans
from itertools import takewhile

# function used by takewhile to detect digits
def is_digit(s): return s.isdigit()

class redux:

    def __init__(self):
       intab = '+-'
       outtab = '  '
       self.trantab = maketrans(intab, outtab)


    # simple-minded, fast, 1-digit algorithm
    def reduce_plusminus(self, s):
        list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
                    for r
                    in s.translate(self.trantab).split()]
        return ''.join(list_form)

    # multi-digit algorithm
    def reduce_plusminus_multi_digit(self, s):
        chunks = s.translate(self.trantab).split()
        digits = [list(takewhile(is_digit, r))
                   for r
                   in chunks]
        # zero case ('else 0') is for missing digit(s)
        numbers = [int(''.join(r)) if r else 0
                   for r
                    in digits]
        # how far to skip (number and skipped letters)
        skips = [len(dig) + num
                 for dig, num
                 in zip(digits, numbers)]
        return ''.join([sequence[skipover:]
                        for skipover, sequence
                        in zip(skips, chunks)])

if __name__ == "__main__":
    p = redux()
    print (p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus("tA.-5AG.-2AG,-2ag"))
    print ('multi-digit...')
    print (p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+ACG.+3ACG"))

    for n in range(100000): p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")
    
    
Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters. 

Explanation:

String slicing is good in Python, but list comprehensions are often the most powerful and efficient way to transform data. If you've come from another language or you're a newb to programming, it's well worth getting grips with list comprehensions [a for b in lst].

First, +/- are converted to spaces. Splitting the string into a list, breaking at spaces, effectively swallows the +/- characters and leaves us with chunks of string from which to drop the specified number of letters. Having dropped the letters, we convert the transformed list back into a string.

The single-digit version just assumes that the number is always present and always single-digit. If you allow multi-digits be ready to skip those multiple digits, as well as skipping letters, adding another stage to the calculation. E.g. in: 

+3ACG. -> . 

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In: 

-11ACGACGACGACG. -> G. 

You skip 2 + 11 characters, 2 characters in "11" and 11 characters following. And incidentally in: 

+ACG. -> ACG. 

there's no digit, so you skip 0 digits + 0 letters. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in the 'digits' list.

For each chunk (stuff between +/-) I use takewhile to separate the zero or more initial digits from the following letters. itertools.takewhile is a nice tool for matching stuff until the first time a condition ceases to hold. In this case I'm matching digits up until the first non-digit.

If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so takewhile and ''.join() do behind-the-scenes leg-work for me in detecting the case of absent digit(s).

For my money speed isn't bad for these two methods (again, I don't know your domain.) On my (ancient) hardware the 1-digit version does over 100,000/s for your longer string, the multi-digit about 35,000/s.

HTH,

Nick

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web