Groups > comp.lang.python > #36153 > unrolled thread

Need a specific sort of string modification. Can someone help?

Started by	Sia <hossein.asgharian@gmail.com>
First post	2013-01-05 00:35 -0800
Last post	2013-01-06 23:07 -0800
Articles	20 on this page of 25 — 10 participants

Back to article view | Back to comp.lang.python

  Need a specific sort of string modification. Can someone help? Sia <hossein.asgharian@gmail.com> - 2013-01-05 00:35 -0800
    Re: Need a specific sort of string modification. Can someone help? Frank Millman <frank@chagford.com> - 2013-01-05 11:15 +0200
    Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-05 20:27 +1100
      Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 09:30 -0500
        Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 01:47 +1100
          Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 10:03 -0500
            Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 02:09 +1100
              Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 10:38 -0500
                Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 02:57 +1100
                Re: Need a specific sort of string modification. Can someone help? Ian Kelly <ian.g.kelly@gmail.com> - 2013-01-05 13:04 -0700
                Re: Need a specific sort of string modification. Can someone help? Chris Angelico <rosuav@gmail.com> - 2013-01-06 07:32 +1100
                  Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 15:47 -0500
                    Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-06 12:28 -0500
                      Re: Need a specific sort of string modification. Can someone help? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-06 23:19 +0000
    Re: Need a specific sort of string modification. Can someone help? Roy Smith <roy@panix.com> - 2013-01-05 09:12 -0500
    Re: Need a specific sort of string modification. Can someone help? Tim Chase <python.list@tim.thechases.com> - 2013-01-05 11:24 -0600
    Re: Need a specific sort of string modification. Can someone help? Tim Chase <python.list@tim.thechases.com> - 2013-01-05 12:49 -0600
    Re: Need a specific sort of string modification. Can someone help? Mitya Sirenef <msirenef@lightbird.net> - 2013-01-06 01:32 -0500
    Re: Need a specific sort of string modification. Can someone help? Mitya Sirenef <msirenef@lightbird.net> - 2013-01-06 14:53 -0500
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 18:48 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 19:40 -0800
      Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 21:28 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 21:30 -0800
    Re: Need a specific sort of string modification. Can someone help? John Ladasky <john_ladasky@sbcglobal.net> - 2013-01-06 21:39 -0800
    Re: Need a specific sort of string modification. Can someone help? Nick Mellor <thebalancepro@gmail.com> - 2013-01-06 23:07 -0800

Page 1 of 2 [1] 2 Next page →

#36153 — Need a specific sort of string modification. Can someone help?

From	Sia <hossein.asgharian@gmail.com>
Date	2013-01-05 00:35 -0800
Subject	Need a specific sort of string modification. Can someone help?
Message-ID	<e480480d-f3b4-4491-969c-7d1843bf9e33@googlegroups.com>

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
...

How can I do that?
Thanks.

[toc] | [next] | [standalone]

#36154

From	Frank Millman <frank@chagford.com>
Date	2013-01-05 11:15 +0200
Message-ID	<mailman.108.1357377381.2939.python-list@python.org>
In reply to	#36153

On 05/01/2013 10:35, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?
> Thanks.
>

Here is a naive solution (I am sure there are more elegant ones) -

def strip(old_string):
     new_string = ''
     max = len(old_string)
     pos = 0
     while pos < max:
         char = old_string[pos]
         if char in ('-', '+'):
             num_pos = pos+1
             num_str = ''
             while old_string[num_pos].isdigit():
                 num_str += old_string[num_pos]
                 num_pos += 1
             pos = num_pos + int(num_str)
         else:
             new_string += old_string[pos]
             pos += 1
     return new_string

It caters for the possibility that the number following the +/- could be 
greater than 9 - I don't know if you need that.

It works with your examples, except that the second one returns '....', 
which I think is correct - there are 4 dots in the original string.

HTH

Frank Millman

[toc] | [prev] | [next] | [standalone]

#36155

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-05 20:27 +1100
Message-ID	<mailman.109.1357378077.2939.python-list@python.org>
In reply to	#36153

On Sat, Jan 5, 2013 at 7:35 PM, Sia <hossein.asgharian@gmail.com> wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?

Interesting. Are you guaranteed that there are no other plus or minus
signs? Is the number after the sign just one digit? Assuming the
answers to both are "yes", here's a one-liner:

s=".+3ACG.+5CAACG.+3ACG.+3ACG"

result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

Split on either - or +, then trim off that many characters from each
(that's the list comp), then join them back into a string.

ChrisA

[toc] | [prev] | [next] | [standalone]

#36175

From	Roy Smith <roy@panix.com>
Date	2013-01-05 09:30 -0500
Message-ID	<roy-564DD8.09304505012013@news.panix.com>
In reply to	#36155

In article <mailman.109.1357378077.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

That's exceedingly clever.  But bordering on line noise.  At the very 
least, I would break it up into a couple of lines to make it easier to 
understand (plus you can print out the intermediate values to see what's 
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]

[toc] | [prev] | [next] | [standalone]

#36177

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-06 01:47 +1100
Message-ID	<mailman.120.1357397255.2939.python-list@python.org>
In reply to	#36175

On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.109.1357378077.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])
>
> That's exceedingly clever.  But bordering on line noise.  At the very
> least, I would break it up into a couple of lines to make it easier to
> understand (plus you can print out the intermediate values to see what's
> going on):
>
> chunks = ("0"+s).replace("-","+").split("+")
> result = "".join([x[int(x[0])+1:] for x in chunks]

Sure. You can always split a one-liner to taste, doesn't much matter
where. You could split it majorly into half a dozen lines if you want,
doesn't make a lot of diff. It'll work the same way :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#36178

From	Roy Smith <roy@panix.com>
Date	2013-01-05 10:03 -0500
Message-ID	<roy-00F3AB.10030605012013@news.panix.com>
In reply to	#36177

In article <mailman.120.1357397255.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote:
> > In article <mailman.109.1357378077.2939.python-list@python.org>,
> >  Chris Angelico <rosuav@gmail.com> wrote:
> >
> >> result = "".join([x[int(x[0])+1:] for x in 
> >> ("0"+s).replace("-","+").split("+")])
> >
> > That's exceedingly clever.  But bordering on line noise.  At the very
> > least, I would break it up into a couple of lines to make it easier to
> > understand (plus you can print out the intermediate values to see what's
> > going on):
> >
> > chunks = ("0"+s).replace("-","+").split("+")
> > result = "".join([x[int(x[0])+1:] for x in chunks]
> 
> Sure. You can always split a one-liner to taste, doesn't much matter
> where.

Well, there are better and worse places to split things.  Consider these 
three statements:

result = f1().f2().f3().f4().f5()
result = f1(f2.f3(f4().f5()))
result = f1(f2(3(f4(f5()))))

Same number of function calls, but the middle one is clearly the most 
difficult to understand.  The reason is because the scan direction keeps 
changing.  The first one is strictly left-to-right.  The last one is 
strictly inside-to-outside.  The middle one is all over the place.

That's why I chose to split this where I did.  It was where the scan 
direction changed.

Readability matters.

[toc] | [prev] | [next] | [standalone]

#36179

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-06 02:09 +1100
Message-ID	<mailman.121.1357398573.2939.python-list@python.org>
In reply to	#36178

On Sun, Jan 6, 2013 at 2:03 AM, Roy Smith <roy@panix.com> wrote:
> That's why I chose to split this where I did.  It was where the scan
> direction changed.

Ah, good point. In any case, this is a fairly simple and clear way of
doing things; it may or may not run faster than the explicit state
machine, but IMHO it's a lot clearer to read a split() than something
that changes state when a particular character is found.

ChrisA

[toc] | [prev] | [next] | [standalone]

#36183

From	Roy Smith <roy@panix.com>
Date	2013-01-05 10:38 -0500
Message-ID	<roy-088CDF.10385405012013@news.panix.com>
In reply to	#36179

In article <mailman.121.1357398573.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> it may or may not run faster than the explicit state machine,

Hmmm, hard to say.  Both are O(n), but yours makes several more copies 
of the data than mine (the string addition, the replace(), the split(), 
the string slicing).  We both make copies as we put fragments into a 
list, and when we do the join().

On the other hand, yours does all that inside the library, which is 
generally faster than random python code.

Let's see:

My way:
real    0m2.097s
user    0m2.078s
sys     0m0.016s

Your way:
real    0m0.649s
user    0m0.622s
sys     0m0.025s

You got me by a factor of 3 or 4.  Not bad.  And not terribly 
surprising.  Once you've got things to the same O() group, pushing as 
much data manipulation as you can down into the library is usually a 
pretty effective optimization technique.

> but IMHO it's a lot clearer to read a split() than something
> that changes state when a particular character is found.

Maybe.  But being familiar with state machines is still a handy skill.  
DNA sequence analysis has lots of problems like "find a start codon 
which is within about 50 bases of a binding site, and then copy 
everything up until you find a stop codon".  Things like that often map 
well to state machines.  Especially if you're trying to do it in 
parallel in all three reading frames.

[toc] | [prev] | [next] | [standalone]

#36189

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-06 02:57 +1100
Message-ID	<mailman.127.1357401452.2939.python-list@python.org>
In reply to	#36183

On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.121.1357398573.2939.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> it may or may not run faster than the explicit state machine,
>
> You got me by a factor of 3 or 4.  Not bad.

You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

But it's gratifying to know that the obvious and brief way to do
things is fast too :)

>> but IMHO it's a lot clearer to read a split() than something
>> that changes state when a particular character is found.
>
> Maybe.  But being familiar with state machines is still a handy skill.
> DNA sequence analysis has lots of problems like "find a start codon
> which is within about 50 bases of a binding site, and then copy
> everything up until you find a stop codon".  Things like that often map
> well to state machines.  Especially if you're trying to do it in
> parallel in all three reading frames.

Sure. And if you're working with petabytes of data, these
considerations become fairly important. When that happens, you start
rewriting your algorithms in C, or using Cython, or something; at very
least, you start rewriting clear and simple algorithms into more
complex ones. But all this happens *after* the code has been tested
and proven. All the rewrites can be verified as being identical to
their reference implementations; you can test one piece at a time as
you change them. It's ever so much easier to work that way.

<anecdote>At work, we had one employee whose code was, shall we say,
less than stellar. At one point, he embarked on a months-long rewrite
of one of his modules; meanwhile, I was unable to adequately test code
that called on it. Once the rewrite was finally complete, I discovered
myriad bugs in my own code, ones that would have been found and fixed
instantly if I'd had even a slow version of the code to work against.
Starting with something you can easily debug helps enormously with
that, because debugging doesn't demand mega-TPS throughput.</anecdote>

ChrisA

[toc] | [prev] | [next] | [standalone]

#36205

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-01-05 13:04 -0700
Message-ID	<mailman.141.1357416824.2939.python-list@python.org>
In reply to	#36183

On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <rosuav@gmail.com> wrote:
> You miss my point, though. I went for simple Pythonic code, and never
> measured its performance, on the expectation that it's "good enough".
> Written in C, the state machine is probably WAY faster than splitting
> and then iterating. My C++ MUD client uses code similar to that to
> parse TELNET and ANSI codes from a stream of bytes in a socket (and
> one of its "states" is that there's no more data available, so wait on
> the socket); the rewrite in a high level language divides the string
> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
> then afterward splits on "\n" to divide into lines. The code's much
> less convoluted, it's easier to test different parts (because I can
> simply call the ANSI parser with a block of text), and on a modern
> computer, you can't see the performance difference (since you spend
> most of your time waiting for socket data anyway).

Anecdotally and somewhat off-topic, when I wrote my own MUD client in
Python, I implemented both TELNET and ANSI parsing in Python using a
state machine processing one byte at a time (actually two state
machines - one at the protocol layer and one at the client layer; the
telnet module is a modified version of the twisted.conch.telnet
module).  I was worried that the processing would be too slow in
Python.  When I got it running, it turned out that there was a
noticeable lag between input being received and displayed.  But when I
profiled the issue it actually turned out that the rich text control I
was using, which was written in C++, wasn't able to handle a large
buffer gracefully.  The parsing code in Python did just fine and
didn't contribute to the lag issue at all.

[toc] | [prev] | [next] | [standalone]

#36206

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-06 07:32 +1100
Message-ID	<mailman.142.1357417943.2939.python-list@python.org>
In reply to	#36183

On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <rosuav@gmail.com> wrote:
>> You miss my point, though. I went for simple Pythonic code, and never
>> measured its performance, on the expectation that it's "good enough".
>> Written in C, the state machine is probably WAY faster than splitting
>> and then iterating. My C++ MUD client uses code similar to that to
>> parse TELNET and ANSI codes from a stream of bytes in a socket (and
>> one of its "states" is that there's no more data available, so wait on
>> the socket); the rewrite in a high level language divides the string
>> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
>> then afterward splits on "\n" to divide into lines. The code's much
>> less convoluted, it's easier to test different parts (because I can
>> simply call the ANSI parser with a block of text), and on a modern
>> computer, you can't see the performance difference (since you spend
>> most of your time waiting for socket data anyway).
>
> Anecdotally and somewhat off-topic, when I wrote my own MUD client in
> Python, I implemented both TELNET and ANSI parsing in Python using a
> state machine processing one byte at a time (actually two state
> machines - one at the protocol layer and one at the client layer; the
> telnet module is a modified version of the twisted.conch.telnet
> module).  I was worried that the processing would be too slow in
> Python.  When I got it running, it turned out that there was a
> noticeable lag between input being received and displayed.  But when I
> profiled the issue it actually turned out that the rich text control I
> was using, which was written in C++, wasn't able to handle a large
> buffer gracefully.  The parsing code in Python did just fine and
> didn't contribute to the lag issue at all.

The opposite decision, the same result. Performance was *good enough*.
With Gypsum, my early performance problems were almost completely
solved by properly handling the expose-event and only painting the
parts that changed (I don't use a rich text control, I use a
GTK2.DrawingArea). Text processing of something that's come over a
network is seldom a problem; if it were, we'd see noticeably slower
downloads over HTTPS than HTTP, and that just doesn't happen. I think
(though I can't prove) that crypto written in C is probably more
expensive than ANSI parsing written in Python.

ChrisA

[toc] | [prev] | [next] | [standalone]

#36209

From	Roy Smith <roy@panix.com>
Date	2013-01-05 15:47 -0500
Message-ID	<roy-103D43.15470305012013@news.panix.com>
In reply to	#36206

In article <mailman.142.1357417943.2939.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> > On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <rosuav@gmail.com> wrote:
> >> You miss my point, though. I went for simple Pythonic code, and never
> >> measured its performance, on the expectation that it's "good enough".
> >> Written in C, the state machine is probably WAY faster than splitting
> >> and then iterating. My C++ MUD client uses code similar to that to
> >> parse TELNET and ANSI codes from a stream of bytes in a socket (and
> >> one of its "states" is that there's no more data available, so wait on
> >> the socket); the rewrite in a high level language divides the string
> >> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
> >> then afterward splits on "\n" to divide into lines. The code's much
> >> less convoluted, it's easier to test different parts (because I can
> >> simply call the ANSI parser with a block of text), and on a modern
> >> computer, you can't see the performance difference (since you spend
> >> most of your time waiting for socket data anyway).
> >
> > Anecdotally and somewhat off-topic, when I wrote my own MUD client in
> > Python, I implemented both TELNET and ANSI parsing in Python using a
> > state machine processing one byte at a time (actually two state
> > machines - one at the protocol layer and one at the client layer; the
> > telnet module is a modified version of the twisted.conch.telnet
> > module).  I was worried that the processing would be too slow in
> > Python.  When I got it running, it turned out that there was a
> > noticeable lag between input being received and displayed.  But when I
> > profiled the issue it actually turned out that the rich text control I
> > was using, which was written in C++, wasn't able to handle a large
> > buffer gracefully.  The parsing code in Python did just fine and
> > didn't contribute to the lag issue at all.
> 
> The opposite decision, the same result. Performance was *good enough*.
> With Gypsum, my early performance problems were almost completely
> solved by properly handling the expose-event and only painting the
> parts that changed (I don't use a rich text control, I use a
> GTK2.DrawingArea). Text processing of something that's come over a
> network is seldom a problem; if it were, we'd see noticeably slower
> downloads over HTTPS than HTTP, and that just doesn't happen. I think
> (though I can't prove) that crypto written in C is probably more
> expensive than ANSI parsing written in Python.
> 
> ChrisA

It's rare to find applications these days that are truly CPU bound.  
Once you've used some reasonable algorithm, i.e. not done anything in 
O(n^2) that could have been done in O(n) or O(n log n), you will more 
often run up against I/O speed, database speed, network latency, memory 
exhaustion, or some such as the reason your code is too slow.

The upshot of this is that for most things, Python (even though it runs 
an order of magnitude slower than C), will always be *good enough*.

I'm sure I've mentioned this before, but the application code for 
Songza.com is 100% Python.  We pretty much can't even measure how much 
CPU time is spent running Python code.  Everything is database, network, 
and (occasionally) disk throughput.

[toc] | [prev] | [next] | [standalone]

#36272

From	Roy Smith <roy@panix.com>
Date	2013-01-06 12:28 -0500
Message-ID	<roy-7E018D.12285506012013@news.panix.com>
In reply to	#36209

In article <roy-103D43.15470305012013@news.panix.com>,
 Roy Smith <roy@panix.com> wrote:

> It's rare to find applications these days that are truly CPU bound.  
> Once you've used some reasonable algorithm, i.e. not done anything in 
> O(n^2) that could have been done in O(n) or O(n log n), you will more 
> often run up against I/O speed, database speed, network latency, memory 
> exhaustion, or some such as the reason your code is too slow.

Well, I just found a counter-example :-)

I've been doing some log analysis.  It's been taking a grovelingly long 
time, so I decided to fire up the profiler and see what's taking so 
long.  I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots 
might be (looking up IP addresses in the geolocation database, or 
producing some pretty pictures using matplotlib).  It was just a matter 
of figuring out which it was.

As with most attempts to out-guess the profiler, I was totally, 
absolutely, and embarrassingly wrong.

It turns out we were spending most of the time parsing timestamps!  
Since there's no convenient way (I don't consider strptime() to be 
convenient) to parse isoformat strings in the standard library, our 
habit has been to use the oh-so-simple parser from the third-party 
dateutil package.  Well, it turns out that's slow as all get-out 
(probably because it's trying to be smart about auto-recognizing 
formats).  For the test I ran (on a few percent of the real data), we 
spent 90 seconds in parse().

OK, so I dragged out the strptime() docs and built the stupid format 
string (%Y-%m-%dT%H:%M:%S+00:00).  That got us down to 25 seconds in 
strptime().

But, I could also see it was spending a significant amount in routines 
that looked like they were computing things like day of the week that we 
didn't need.  For what I was doing, we only really needed the hour and 
minute.  So I tried:

        t_hour = int(date[11:13])
        t_minute = int(date[14:16])

that got us down to 12 seconds overall (including the geolocation and 
pretty pictures).

I think it turns out we never do anything with the hour and minute other 
than print them back out, so just

       t_hour_minute = date[11:16]

would probably be good enough, but I think I'm going to stop where I am 
and declare victory :-)

[toc] | [prev] | [next] | [standalone]

#36302

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-01-06 23:19 +0000
Message-ID	<50ea0699$0$30003$c3e8da3$5496439d@news.astraweb.com>
In reply to	#36272

On Sun, 06 Jan 2013 12:28:55 -0500, Roy Smith wrote:

> I've been doing some log analysis.  It's been taking a grovelingly long
> time, so I decided to fire up the profiler and see what's taking so
> long.  I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
> might be (looking up IP addresses in the geolocation database, or
> producing some pretty pictures using matplotlib).  It was just a matter
> of figuring out which it was.
> 
> As with most attempts to out-guess the profiler, I was totally,
> absolutely, and embarrassingly wrong.

+1 QOTW

Thank you for sharing this with us. 


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#36173

From	Roy Smith <roy@panix.com>
Date	2013-01-05 09:12 -0500
Message-ID	<roy-1F6C9B.09123105012013@news.panix.com>
In reply to	#36153

In article <e480480d-f3b4-4491-969c-7d1843bf9e33@googlegroups.com>,
 Sia <hossein.asgharian@gmail.com> wrote:

> I have strings such as:
> 
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG

Some kind of DNA binding site?

A couple of questions.  Are the numbers always single digits?  How much 
data is there?  Are we talking a few hundred 20-character strings, or 
all of Genbank?

> The plus and minus signs are always followed by a number (say, i). I want 
> python to find each single plus or minus, remove the sign, the number after 
> it and remove i characters after that. So the two strings above become:
> 
> tA..,
> and
> ...

If I follow your description properly, the last output should be "...." 
(4 dots), right?  This looks like it should work.  I'm sure there's more 
efficient ways to do it, but for small inputs, this should be ok.˜

The general pattern here is a state machine.  It's a good pattern to 
learn if you're going to be doing any kind of sequence analysis.  See, 
for example, http://en.wikipedia.org/wiki/State_machine.

# Build up the new string as a list (for efficiency)                                                
new = []

# Keep track of what state we're in.  The three possible states                                     
# are 1) scanning for a region to be deleted, 2) looking for the                                    
# number, and 3) reading past the letters to be deleted.                                            
SCANNING = 1
NUMBER = 2
DROPPING = 3
state = SCANNING

# If we are in state DROPPING, dropcount is the number of                                           
# letters remaining to be dropped.                                                                  
dropcount = 0

old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
for c in old:
    if state == SCANNING:
        if c in '+-':
            state = NUMBER
        else:
            new.append(c)

    elif state == NUMBER:
        # Assume the counts are all single digits.  If not, then
        # we'll need a 4th state for accumulating the digits.  
        dropcount = int(c)
        state = DROPPING

    else:
        assert state == DROPPING
        dropcount -= 1
        if dropcount == 0:
            state = SCANNING

print ''.join(new)

[toc] | [prev] | [next] | [standalone]

#36196

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-01-05 11:24 -0600
Message-ID	<mailman.134.1357406586.2939.python-list@python.org>
In reply to	#36153

On 01/05/13 02:35, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...

With the same caveat as Frank posted about the second one being 
"...." (4 dots), I don't know how this version times out:

   import re
   r = re.compile(r"[-+](\d+)([^-+]*)")
   def modify(m):
       result = m.group(2)[int(m.group(1)):]
       return result
   for test, expected in (
           ("tA.-2AG.-2AG,-2ag", "tA..,"),
           (".+3ACG.+5CAACG.+3ACG.+3ACG", "...."),
           ):
       s = r.sub(modify, test)
       print "%r -> %r (%r)" % (
           test, s, expected
           )
       assert s == expected, "Nope"

(it passes the tests as modified to "....")

-tkc

[toc] | [prev] | [next] | [standalone]

#36198

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-01-05 12:49 -0600
Message-ID	<mailman.135.1357411673.2939.python-list@python.org>
In reply to	#36153

On 01/05/13 11:24, Tim Chase wrote:
>  I don't know how this version times out:
>
>     import re
>     r = re.compile(r"[-+](\d+)([^-+]*)")
>     def modify(m):
>         result = m.group(2)[int(m.group(1)):]
>         return result

Doh, I intended to change this after testing, making it just

   returm m.group(2)[int(m.group(1)):]

rather than expending an intermediate variable I used for testing

-tkc

[toc] | [prev] | [next] | [standalone]

#36228

From	Mitya Sirenef <msirenef@lightbird.net>
Date	2013-01-06 01:32 -0500
Message-ID	<mailman.156.1357453978.2939.python-list@python.org>
In reply to	#36153

On 01/05/2013 03:35 AM, Sia wrote:
> I have strings such as:
 >
 > tA.-2AG.-2AG,-2ag
 > or
 > .+3ACG.+5CAACG.+3ACG.+3ACG
 >
 > The plus and minus signs are always followed by a number (say, i). I 
want python to find each single plus or minus, remove the sign, the 
number after it and remove i characters after that. So the two strings 
above become:
 >
 > tA..,
 > and
 > ...
 >
 > How can I do that?
 > Thanks.


I think it's a bit cleaner and nicer to do something similar to
itertools.takewhile but takewhile 'eats' a single next value.
I was actually doing some stuff that also needed this. I wonder if
there's a more elegant, robust way to do this?

Here's what I got for now:


class BIterator(object):
     """Iterator with 'buffered' takewhile."""

     def __init__(self, seq):
         self.seq        = iter(seq)
         self.buffer     = []
         self.end_marker = object()
         self.last       = None

     def consume(self, n):
         for _ in range(n): self.next()

     def next(self):
         val = self.buffer.pop() if self.buffer else next(self.seq, 
self.end_marker)
         self.last = val
         return val

     def takewhile(self, test):
         lst = []
         while True:
             val = self.next()
             if val is self.end_marker:
                 return lst
             elif test(val):
                 lst.append(val)
             else:
                 self.buffer.append(val)
                 return lst

     def joined_takewhile(self, test):
         return ''.join(self.takewhile(test))

     def done(self):
         return bool(self.last is self.end_marker)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
not_plusminus = lambda x: x not in "+-"
isdigit       = lambda x: x.isdigit()

def process(s):
     lst = []
     s   = BIterator(s)

     while True:
         lst.extend(s.takewhile(not_plusminus))
         if s.done(): break
         s.next()
         n = int(s.joined_takewhile(isdigit))
         s.consume(n)

     return ''.join(lst)


print(process(s))


Obviously it assumes the input is well-formed, but the logic would be
very easy to change to, for example, check for s.done() after each step.

  - mitya



-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/

[toc] | [prev] | [next] | [standalone]

#36280

From	Mitya Sirenef <msirenef@lightbird.net>
Date	2013-01-06 14:53 -0500
Message-ID	<mailman.180.1357502014.2939.python-list@python.org>
In reply to	#36153

On 01/06/2013 01:32 AM, Mitya Sirenef wrote:
> On 01/05/2013 03:35 AM, Sia wrote:
>> I have strings such as:
> >
> > tA.-2AG.-2AG,-2ag
> > or
> > .+3ACG.+5CAACG.+3ACG.+3ACG
> >
> > The plus and minus signs are always followed by a number (say, i). I 
> want python to find each single plus or minus, remove the sign, the 
> number after it and remove i characters after that. So the two strings 
> above become:
> >
> > tA..,
> > and
> > ...
> >
> > How can I do that?
> > Thanks.
>
>
> I think it's a bit cleaner and nicer to do something similar to
> itertools.takewhile but takewhile 'eats' a single next value.
> I was actually doing some stuff that also needed this. I wonder if
> there's a more elegant, robust way to do this?
>
> Here's what I got for now:
>
>
> class BIterator(object):
>     """Iterator with 'buffered' takewhile."""
>
>     def __init__(self, seq):
>         self.seq        = iter(seq)
>         self.buffer     = []
>         self.end_marker = object()
>         self.last       = None
>
>     def consume(self, n):
>         for _ in range(n): self.next()
>
>     def next(self):
>         val = self.buffer.pop() if self.buffer else next(self.seq, 
> self.end_marker)
>         self.last = val
>         return val
>
>     def takewhile(self, test):
>         lst = []
>         while True:
>             val = self.next()
>             if val is self.end_marker:
>                 return lst
>             elif test(val):
>                 lst.append(val)
>             else:
>                 self.buffer.append(val)
>                 return lst
>
>     def joined_takewhile(self, test):
>         return ''.join(self.takewhile(test))
>
>     def done(self):
>         return bool(self.last is self.end_marker)
>
>
> s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
> not_plusminus = lambda x: x not in "+-"
> isdigit       = lambda x: x.isdigit()
>
> def process(s):
>     lst = []
>     s   = BIterator(s)
>
>     while True:
>         lst.extend(s.takewhile(not_plusminus))
>         if s.done(): break
>         s.next()
>         n = int(s.joined_takewhile(isdigit))
>         s.consume(n)
>
>     return ''.join(lst)
>
>
> print(process(s))
>
>
> Obviously it assumes the input is well-formed, but the logic would be
> very easy to change to, for example, check for s.done() after each step.
>
>  - mitya
>
>
>

I've added some refinements:



class BIterator(object):
     """Iterator with 'buffered' takewhile and takeuntil."""

     def __init__(self, seq):
         self.seq        = iter(seq)
         self.buffer     = []
         self.end_marker = object()
         self.last       = None

     def __bool__(self):
         return self.last is not self.end_marker

     def __next__(self):
         val = self.buffer.pop() if self.buffer else next(self.seq, 
self.end_marker)
         self.last = val
         return val

     def consume(self, n):
         for _ in range(n): next(self)

     def takewhile(self, test):
         lst = []
         while True:
             val = next(self)
             if val is self.end_marker:
                 return lst
             elif test(val):
                 lst.append(val)
             else:
                 self.buffer.append(val)
                 return lst

     def takeuntil(self, test):
         negtest = lambda x: not test(x)
         return self.takewhile(negtest)

     def joined_takewhile(self, test):
         return ''.join(self.takewhile(test))

     def joined_takeuntil(self, test):
         return ''.join(self.takeuntil(test))


def process(s):
     s         = BIterator(s)
     lst       = []
     plusminus = lambda x: x in "+-"
     isdigit   = lambda x: x.isdigit()

     while s:
         lst.extend(s.takeuntil(plusminus))
         next(s)
         n = s.joined_takewhile(isdigit) or 0
         s.consume(int(n))

     return ''.join(lst)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
print(process(s))




-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/

[toc] | [prev] | [next] | [standalone]

#36318

From	Nick Mellor <thebalancepro@gmail.com>
Date	2013-01-06 18:48 -0800
Message-ID	<2ccd1e6e-37aa-46d0-a02a-baaa859c055b@googlegroups.com>
In reply to	#36153

Hi Sia,

Here's another variation. It's within my tolerance for readability :-) and also quick, if that's an issue. It does 100,000 of your longer string in a couple of seconds on my venerable laptop.

It handles only single-digit numbers. For multi-digit, I'd be inclined to have a look at takewhile and/or dropwhile, both in itertools (Python 2.7 and 3.x only.)


from string import maketrans

class redux:

    def __init__(self):
       intab = '+-'
       outtab = '  ' # two spaces
       self.trantab = maketrans(intab, outtab)

    
    def reduce_plusminus(self, s):
        list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
                    for r
                    in s.translate(self.trantab).split()]
        return ''.join(list_form)

if __name__ == "__main__":
    p = redux()
    print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
    print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")

    for n in range(100000): p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Need a specific sort of string modification. Can someone help?

Contents

#36153 — Need a specific sort of string modification. Can someone help?

#36154

#36155

#36175

#36177

#36178

#36179

#36183

#36189

#36205

#36206

#36209

#36272

#36302

#36173

#36196

#36198

#36228

#36280

#36318