Groups > comp.lang.python > #35242 > unrolled thread

help with making my code more efficient

Started by	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
First post	2012-12-20 16:19 -0800
Last post	2012-12-21 02:08 +0000
Articles	3 on this page of 23 — 6 participants

Back to article view | Back to comp.lang.python

  help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:19 -0800
    Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 11:38 +1100
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
        Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 13:56 +1100
        Re: help with making my code more efficient Roy Smith <roy@panix.com> - 2012-12-20 22:30 -0500
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
    Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 20:17 -0500
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
        Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:39 -0500
        Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:49 -0500
        Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 22:31 -0500
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
            Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
              Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-21 22:19 -0500
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
                  Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-22 01:47 -0500
                    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
                    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
            Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
    Re: help with making my code more efficient MRAB <python@mrabarnett.plus.com> - 2012-12-21 02:08 +0000

Page 2 of 2 — ← Prev page 1 [2]

#35316

From	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date	2012-12-21 12:36 -0800
Message-ID	<mailman.1166.1356122185.29569.python-list@python.org>
In reply to	#35311

On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> > On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote: 
> > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: 
> > >> <snip>
 
> > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 
 
> > But it seems these are methods in a class, or something, so we're 
> > missing context.  And you use self without it being an argument to the 
> > function.  Like it's a global.

> I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as:
>  
>     def generate_data(self):

> > > <snip>
 
> > > Yes, the code works. I end up with just the rows I want.
  
> > >> Are you only concerned about speed, not fixing features?  
 
> > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
 
> > >> As far as I can tell, the logic that includes the time comparison is bogus.  
 
> > > Not at all. 

> > >> You don't do  anything there to worry about the value of tup[2], just whether some
> > >> item has a nearby time.  Of course, I could misunderstand the spec.

> > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 
 
> > I thought that tup[1] was the datetime.  In any case, the loop makes no
> > sense to me, so I can't really optimize it, just make suggestions.
 
> Yes, tup[1] is the datetime. I mistyped last night. 

> > >> Are you making a global called 'self' ?  That name is by convention only
> > >> used in methods to designate the instance object.  What's the attribute
> > >> self?

> > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. 

> > >> Can cdata have duplicates, and are they significant? 

> > > No, it will not have duplicates.

> > >> Is the list sorted in any way?
 
> > > Yes, the list is sorted by tool and datetime.

> > >> Chances are your performance bottleneck is the doubly-nested loop.  You

> > >> have a list comprehension at top-level code, and inside it calls a

> > >> function that also loops over the 600,000 items.  So the inner loop gets

> > >> executed 360 billion times.  You can cut this down drastically by some

> > >> judicious sorting, as well as by having a map of lists, where the map is
> > >> keyed by the tool.

> > > Thanks. I will try that.

> > So in your first loop, you could simply split the list into separate 
> > lists, one per tup[0] value, and the lists as dictionary items, keyed by
> > that tool string.

> > Then inside the determine() function, make a local ref to the particular 
> > list for the tool.
> >    recs = messageTimes[tup[0]]

> I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. 

> > Instead of a for loop over recs, use a binary search to identify the 
> > first item that's >= date_time-tdiff.  Then if it's less than 
> > date_time+tdiff, return True, otherwise False.  Check out the bisect 
> > module.  Function bisect_left() should do what you want in a sorted list.

> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 

The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 

Thanks very much for all your help. 




> This was the code I had:
>  
> times = messageTimes[tup[0]]
> 
> le = bisect.bisect_right(times, tup[1])
> 
> ge = bisect.bisect_left(times, tup[1])
> 
> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> > >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> > 
> 
> > >> As the code exists, there's no need to copy the list.  Just do a simple 
> 
> > >> bind.
> 
> > 
> 
> > > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 
> 
> > 
> 
> > Every "assignment" to a simple name is really a rebinding of that name.
> 
> > cdata = [tup for tup in cdata if determine(tup)]
> 
> > 
> 
> > will rebind the name to the new object, much quicker than copying.  If
> 
> > this is indeed a top-level line, it should be equivalent.  But if in
> 
> > fact this is inside some other function, it may violate some other
> 
> > assumptions.  In particular, if there are other names for the same
> 
> > object, then you're probably stuck with modifying it in place, using
> 
> > slice notation.
> 
> 
> 
> The slice notation  was left over when when cdata was a tuple. Now that it's a list I don't need that any more. 
> 
> 
> 
> > BTW, a set is generally much more memory efficient than a dict, when you
> 
> > don't use the "value".  But since I think you'll be better off with a
> 
> > dict of lists, it's a moot point.
> 
> 
> 
> I'm going back to square 1 and try and do all from SQL.

[toc] | [prev] | [next] | [standalone]

#35257

From	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date	2012-12-20 17:46 -0800
Message-ID	<mailman.1128.1356054403.29569.python-list@python.org>
In reply to	#35252

On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
> 
> > I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool. 
> 
> > Here is how I am currently doing this:
> 
> No, it's not.  This is a fragment of code, without enough clues as to
> 
> what else is going.  We can guess, but that's likely to make a mess.

Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 

> First question is whether this code works exactly correctly?  

Yes, the code works. I end up with just the rows I want. 

> Are you only concerned about speed, not fixing features?  

Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogus.  

Not at all. 

> You don't do  anything there to worry about the value of tup[2], just whether some
> item has a nearby time.  Of course, I could misunderstand the spec.

The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 

> Are you making a global called 'self' ?  That name is by convention only
> used in methods to designate the instance object.  What's the attribute
> self?

Yes, self is my instance object. self.message contains the string of interest that I need to look for. 

> Can cdata have duplicates, and are they significant? 

No, it will not have duplicates.

> Are you including  the time building that as part of your 2 hour measurement? 

No, the 2 hours is just the time to run the 

cdata[:] = [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?

Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop.  You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items.  So the inner loop gets
> executed 360 billion times.  You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.

Thanks. I will try that.

> > # record time for each message matching the specified message for each tool
> 
> > messageTimes = {}
>
> You're building a dictionary;  are you actually using the value (1), or
>  is only the key relevant?  

Only the keys.

> A set is a dict without a value.  

Yes, I could use a set, but I don't think that would make it measurably faster. 

> But more mportantly, you never look up anything in this dictionary.  So why
> isn't it a list?  For that matter, why don't you just use the
> messageTimes list?

Yes, it could be a list too.
 
> > for row in cdata:   # tool, time, message
> 
> >     if self.message in row[2]:
> 
> >         messageTimes[row[0], row[1]] = 1
> 
> >
> 
> > # now pull out each message that is within the time diff for each matched message
> 
> > # as well as the matched messages themselves
> 
> >
> 
> > def determine(tup):
> 
> >     if self.message in tup[2]: return True      # matched message 
> 
> >
> 
> >     for (tool, date_time) in messageTimes:
> 
> >         if tool == tup[0]:
> 
> >             if abs(date_time-tup[1]) <= tdiff: 
> 
> >                return True
> 
> >
> 
> >     return False
> 
> >         
> 
> > cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> 
> 
> As the code exists, there's no need to copy the list.  Just do a simple
> bind.

This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 

> 
> 
> 
> >
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run. 
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?

[toc] | [prev] | [next] | [standalone]

#35260

From	MRAB <python@mrabarnett.plus.com>
Date	2012-12-21 02:08 +0000
Message-ID	<mailman.1130.1356055729.29569.python-list@python.org>
In reply to	#35242

On 2012-12-21 00:19, Larry.Martell@gmail.com wrote:
> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
>
> Here is how I am currently doing this:
>
> # record time for each message matching the specified message for each tool
> messageTimes = {}
> for row in cdata:   # tool, time, message
>      if self.message in row[2]:
>          messageTimes[row[0], row[1]] = 1
>
It looks like 'messageTimes' is really a set of tool/time pairs.

You could make it a dict of sets of time; in other words, a set of
times for each tool:

messageTimes = defaultdict(set)
for row in cdata:   # tool, time, message
     if self.message in row[2]:
         messageTimes[row[0]].add(row[1])

> # now pull out each message that is within the time diff for each matched message
> # as well as the matched messages themselves
>
> def determine(tup):
>      if self.message in tup[2]: return True      # matched message
>
>      for (tool, date_time) in messageTimes:
>          if tool == tup[0]:
>              if abs(date_time-tup[1]) <= tdiff:
>                 return True
>
>      return False
>
def determine(tup):
      if self.message in tup[2]: return True      # matched message

      # Scan through the times for the tool given by tup[0].
      for date_time in messageTimes[tup[0]]:
          if abs(date_time - tup[1]) <= tdiff:
             return True

      return False

> cdata[:] = [tup for tup in cdata if determine(tup)]
>
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>
> Can anyone give me some suggestions on speeding this up?
>

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

help with making my code more efficient

Contents

#35316

#35257

#35260