Re: help with making my code more efficient

Path	csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<Larry.Martell@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'algorithm': 0.03; 'true,': 0.04; 'context': 0.05; 'diff': 0.05; 'intermediate': 0.05; 'memory.': 0.05; '21,': 0.07; 'caller': 0.07; 'function,': 0.07; 'matches': 0.07; 'subject:code': 0.07; 'tool,': 0.07; 'subject:help': 0.07; 'missed': 0.09; 'dict': 0.09; 'friday,': 0.09; 'item,': 0.09; 'keyed': 0.09; 'loop.': 0.09; 'matched': 0.09; 'predecessor': 0.09; 'referenced': 0.09; 'rows': 0.09; 'rows,': 0.09; 'threshold': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'def': 0.10; 'index': 0.13; '80k': 0.16; 'bisect': 0.16; 'fold': 0.16; 'loops': 0.16; 'row': 0.16; 'seconds.': 0.16; 'structure.': 0.16; 'subject:making': 0.16; 'sure.': 0.16; 'tool.': 0.16; 'tup': 0.16; 'typo': 0.16; 'string': 0.17; 'wrote:': 0.17; 'element': 0.17; 'passes': 0.17; 'skip': 0.17; 'string,': 0.17; 'obviously': 0.18; 'memory': 0.18; 'variable': 0.20; 'changes': 0.20; 'sort': 0.21; 'thanks.': 0.21; 'do.': 0.21; 'error.': 0.21; 'supposed': 0.21; 'assignment': 0.22; 'assuming': 0.22; 'delta': 0.22; 'parse': 0.22; "user's": 0.22; 'help.': 0.22; "i'd": 0.22; 'matching': 0.23; 'nearly': 0.23; 'originally': 0.23; 'specified': 0.23; "i've": 0.23; 'cc:21': 0.24; 'machine': 0.24; 'pass': 0.25; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'skip:m 30': 0.26; 'checking': 0.27; 'done.': 0.27; 'wonder': 0.27; 'fixed': 0.28; 'all.': 0.28; 'post': 0.28; 'record': 0.28; 'directly,': 0.29; 'equivalent.': 0.29; 'case,': 0.29; 'probably': 0.29; "i'm": 0.29; "we're": 0.30; 'knows': 0.30; 'worked': 0.30; 'query': 0.30; 'returned': 0.30; 'error': 0.30; 'seconds': 0.30; 'up.': 0.31; 'lists': 0.31; 'code': 0.31; 'point': 0.31; 'gets': 0.32; 'december': 0.32; 'help,': 0.32; 'could': 0.32; 'message.': 0.33; 'messages,': 0.33; 'minutes,': 0.33; 'true.': 0.33; 'code:': 0.33; 'equal': 0.33; 'themselves': 0.33; 'that,': 0.34; "can't": 0.34; 'skip:b 20': 0.34; 'received:google.com': 0.34; 'done': 0.34; 'thanks': 0.34; 'list': 0.35; 'needed': 0.35; 'data,': 0.35; 'false': 0.35; 'pm,': 0.35; "won't": 0.35; 'received:209.85': 0.35; 'list.': 0.35; 'really': 0.36; 'tool': 0.36; 'but': 0.36; 'be.': 0.36; 'cc:no real name:21': 0.36; 'compare': 0.36; "didn't": 0.36; 'subject:with': 0.36; 'test': 0.36; 'making': 0.64; 'taking': 0.65; 'total': 0.65; 'improvement.': 0.65; 'unnecessary': 0.65; 'hours': 0.66; 'talking': 0.66; 'records': 0.68; 'wish': 0.70; 'now:': 0.71; '100': 0.78; '30k': 0.84; 'disappointed': 0.84; 'improvement': 0.84; 'itself?': 0.84; 'fragment': 0.91; 'angel': 0.93
Newsgroups	comp.lang.python
Date	Fri, 21 Dec 2012 20:47:28 -0800 (PST)
In-Reply-To	<mailman.1183.1356146412.29569.python-list@python.org>
Complaints-To	groups-abuse@google.com
Injection-Info	glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h
References	<bd5ae6b7-2440-42e4-a93c-eb877feebcfe@googlegroups.com> <mailman.1126.1356052648.29569.python-list@python.org> <d6aaa5b5-7d21-4018-ba9a-ea354b15b6c5@googlegroups.com> <mailman.1134.1356060700.29569.python-list@python.org> <mailman.1163.1356112649.29569.python-list@python.org> <b334fb0c-0e76-400d-90e4-1ecde1cc2090@googlegroups.com> <mailman.1183.1356146412.29569.python-list@python.org>
User-Agent	G2/1.0
X-Google-Web-Client	true
X-Google-IP	68.84.146.219
MIME-Version	1.0
Subject	Re: help with making my code more efficient
From	"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
To	comp.lang.python@googlegroups.com
Content-Type	text/plain; charset=ISO-8859-1
Content-Transfer-Encoding	quoted-printable
Cc	python-list@python.org, d@davea.name
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Message-ID	<mailman.1184.1356151656.29569.python-list@python.org> (permalink)
Lines	152
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1356151656 news.xs4all.nl 6921 [2001:888:2000:d::a6]:57426
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:35338

Show key headers only | View raw

On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> 
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> 
> >> <snip>
> 
> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> 
> > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
> 
> >
> 
> > Thanks very much for all your help. 
> 
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold 
> improvement.  Assuming we're talking just about time spent in the cdata 
>  = list-comprehension list.
> 
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
> 
> >> This was the code I had:
> >> 
> >> times = messageTimes[tup[0]]
> 
> >>
> 
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure 
> this is equivalent.  And I wonder if even looking up tup[1] is 
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
>  
> I wish you could post some sample data, and identify which records are 
> intended to be True.  Obviously you could simplify, and just use ints 
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
> 
> Could we see the whole fragment as you originally had it, but with the 
> optimizations you've done so far?  The more I look at that determine() 
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.

Here's my original code:

# record time for each message matching the specified message for each tool 
messageTimes = {} 
for row in cdata:   # tool, time, message 
    if self.message in row[2]: 
        messageTimes[row[0], row[1]] = 1 

# now pull out each message that is within the time diff for each matched message 
# as well as the matched messages themselves 

def determine(tup): 
    if self.message in tup[2]: return True      # matched message 

    for (tool, date_time) in messageTimes: 
        if tool == tup[0]: 
            if abs(date_time-tup[1]) <= tdiff: 
               return True 

    return False 
        
cdata[:] = [tup for tup in cdata if determine(tup)] 

Here's the code now:


       # Parse data and make a list of the time for each message matching the specified message for each tool
        messageTimes = defaultdict(list)    # a dict with sorted lists

        for row in cdata:   # tool, time, message
            if self.message in row[2]:
                messageTimes[row[0]].append(row[1])

        # now pull out each message that is within the time context for each matched message
        # as well as the matched messages themselves

        # return true is we should keep this message
        def determine(tup):
            if self.message in tup[2]: return True     # matched message
            if seconds == 0: return False                # no time context specified

            times = messageTimes[tup[0]]              # get the list of matched messages for this tool
           
            le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1]
            ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1]
            return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

        cdata = [tup for tup in cdata if determine(tup)]

In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. 

I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 

I hope I've explained this better. Thanks again.

Thread

help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:19 -0800
  Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 11:38 +1100
    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
      Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 13:56 +1100
      Re: help with making my code more efficient Roy Smith <roy@panix.com> - 2012-12-20 22:30 -0500
    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
  Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 20:17 -0500
    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
      Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:39 -0500
      Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:49 -0500
      Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 22:31 -0500
        Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
        Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
            Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-21 22:19 -0500
              Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
                Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-22 01:47 -0500
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
              Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
  Re: help with making my code more efficient MRAB <python@mrabarnett.plus.com> - 2012-12-21 02:08 +0000

csiph-web