Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.03; 'true,': 0.04; 'context': 0.05; 'diff': 0.05; 'intermediate': 0.05; 'memory.': 0.05; '21,': 0.07; 'caller': 0.07; 'function,': 0.07; 'matches': 0.07; 'subject:code': 0.07; 'tool,': 0.07; 'subject:help': 0.07; 'missed': 0.09; 'dict': 0.09; 'friday,': 0.09; 'item,': 0.09; 'keyed': 0.09; 'loop.': 0.09; 'matched': 0.09; 'predecessor': 0.09; 'referenced': 0.09; 'rows': 0.09; 'rows,': 0.09; 'threshold': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr:python-list': 0.10; 'def': 0.10; 'index': 0.13; '80k': 0.16; 'bisect': 0.16; 'fold': 0.16; 'loops': 0.16; 'row': 0.16; 'seconds.': 0.16; 'structure.': 0.16; 'subject:making': 0.16; 'sure.': 0.16; 'tool.': 0.16; 'tup': 0.16; 'typo': 0.16; 'string': 0.17; 'wrote:': 0.17; 'element': 0.17; 'passes': 0.17; 'skip': 0.17; 'string,': 0.17; 'obviously': 0.18; 'memory': 0.18; 'variable': 0.20; 'changes': 0.20; 'sort': 0.21; 'thanks.': 0.21; 'do.': 0.21; 'error.': 0.21; 'supposed': 0.21; 'assignment': 0.22; 'assuming': 0.22; 'delta': 0.22; 'parse': 0.22; "user's": 0.22; 'help.': 0.22; "i'd": 0.22; 'matching': 0.23; 'nearly': 0.23; 'originally': 0.23; 'specified': 0.23; "i've": 0.23; 'cc:2**1': 0.24; 'machine': 0.24; 'pass': 0.25; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'skip:m 30': 0.26; 'checking': 0.27; 'done.': 0.27; 'wonder': 0.27; 'fixed': 0.28; 'all.': 0.28; 'post': 0.28; 'record': 0.28; 'directly,': 0.29; 'equivalent.': 0.29; 'case,': 0.29; 'probably': 0.29; "i'm": 0.29; "we're": 0.30; 'knows': 0.30; 'worked': 0.30; 'query': 0.30; 'returned': 0.30; 'error': 0.30; 'seconds': 0.30; 'up.': 0.31; 'lists': 0.31; 'code': 0.31; 'point': 0.31; 'gets': 0.32; 'december': 0.32; 'help,': 0.32; 'could': 0.32; 'message.': 0.33; 'messages,': 0.33; 'minutes,': 0.33; 'true.': 0.33; 'code:': 0.33; 'equal': 0.33; 'themselves': 0.33; 'that,': 0.34; "can't": 0.34; 'skip:b 20': 0.34; 'received:google.com': 0.34; 'done': 0.34; 'thanks': 0.34; 'list': 0.35; 'needed': 0.35; 'data,': 0.35; 'false': 0.35; 'pm,': 0.35; "won't": 0.35; 'received:209.85': 0.35; 'list.': 0.35; 'really': 0.36; 'tool': 0.36; 'but': 0.36; 'be.': 0.36; 'cc:no real name:2**1': 0.36; 'compare': 0.36; "didn't": 0.36; 'subject:with': 0.36; 'test': 0.36; 'making': 0.64; 'taking': 0.65; 'total': 0.65; 'improvement.': 0.65; 'unnecessary': 0.65; 'hours': 0.66; 'talking': 0.66; 'records': 0.68; 'wish': 0.70; 'now:': 0.71; '100': 0.78; '30k': 0.84; 'disappointed': 0.84; 'improvement': 0.84; 'itself?': 0.84; 'fragment': 0.91; 'angel': 0.93 Newsgroups: comp.lang.python Date: Fri, 21 Dec 2012 20:47:28 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 68.84.146.219 MIME-Version: 1.0 Subject: Re: help with making my code more efficient From: "Larry.Martell@gmail.com" To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org, d@davea.name X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 152 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356151656 news.xs4all.nl 6921 [2001:888:2000:d::a6]:57426 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35338 On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote: > On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote: >=20 > > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wro= te: >=20 > >> >=20 > >> Didn't know about bisect. Thanks. I thought it would be my savior for = sure. But unfortunaly when I added that, it blows up with out of memory.=20 >=20 > > The out of memory error had nothing to do with using bisect. I had intr= oduced a typo that I really though would have caused a variable referenced = before assignment error. But it did not do that, and instead somehow caused= all the memory in my machine to get used up. When I fixed that, it worked = really well with bisect. The code that was taking 2 hours was down to 20 mi= nutes, and even better, a query that was taking 40 minutes was down to 8 se= conds.=20 >=20 > > >=20 > > Thanks very much for all your help.=20 >=20 > Congratulations. But you're not done. A fourfold improvement isn't > nearly as much as I'd expect. When you go from a n-squared algorithm to > an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold=20 > improvement. Assuming we're talking just about time spent in the cdata= =20 > =3D list-comprehension list. >=20 > It's also probably possible to simplify, and get some speedup from other > parts of the code other than that one loop. For example, if the bisect > is really working right, it's probably unnecessary to even copy the > original cdata. You said it was sorted. So bisect it directly, and > skip the messageTimes intermediate data structure. It may not help, but > i suspect it will. > > >=20 > >> This was the code I had: > >>=20 > >> times =3D messageTimes[tup[0]] >=20 > >> >=20 > >> le =3D bisect.bisect_right(times, tup[1]) > >> ge =3D bisect.bisect_left(times, tup[1]) > >> return (le and tup[1]-times[le-1] <=3D tdiff) or (ge !=3D len(times) a= nd times[ge]-tup[1] <=3D tdiff) >=20 >=20 >=20 > I'm stymied making further suggestions to this fragment. Without seeing > the changes you apparently made to messageTimes, I can't even be sure=20 > this is equivalent. And I wonder if even looking up tup[1] is=20 > worthwhile, since your caller already knows the index in cdata. If you > used cdata directly, you might not need a sort at all. > =20 > I wish you could post some sample data, and identify which records are=20 > intended to be True. Obviously you could simplify, and just use ints=20 > where it's using datetimes here. But if your rule is simply accept all > records that come in a cluster with no delta more than a threshold, then > you don't even need any search at all. Just pass the index in, and > compare the datetime with its predecessor and successor. >=20 > Could we see the whole fragment as you originally had it, but with the=20 > optimizations you've done so far? The more I look at that determine()=20 > function, the more futile it seems. I just don't understand what the > criteria are supposed to be. Your list-comp loops through all of cdata, > and for each item, passes that item to determine(). determine() loops > through a copy of cdata, checking to see if any of them is within tdiff > of tup. But won't that always return True, since you'll encounter the > record and compare it to itself? I think you're misunderstanding what I need to do. I have a set of rows fro= m the database with tool, time, and message. The user has specified a strin= g and a time threshold. From that set of rows I need to return all the rows= that contain the user's string and all the other rows that match the tool = from the matched rows and have a time within the threshold. cdata has all the rows. messageTimes has the times of all the matched messa= ges, keyed by tool. In determine() I don't look though cdata - it gets one = element from cdata and I see if that should be selected because it either m= atches the user's message, or it's within the threshold of one that did mat= ch. Here's my original code: # record time for each message matching the specified message for each tool= =20 messageTimes =3D {}=20 for row in cdata: # tool, time, message=20 if self.message in row[2]:=20 messageTimes[row[0], row[1]] =3D 1=20 # now pull out each message that is within the time diff for each matched m= essage=20 # as well as the matched messages themselves=20 def determine(tup):=20 if self.message in tup[2]: return True # matched message=20 for (tool, date_time) in messageTimes:=20 if tool =3D=3D tup[0]:=20 if abs(date_time-tup[1]) <=3D tdiff:=20 return True=20 return False=20 =20 cdata[:] =3D [tup for tup in cdata if determine(tup)]=20 Here's the code now: # Parse data and make a list of the time for each message matching t= he specified message for each tool messageTimes =3D defaultdict(list) # a dict with sorted lists for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0]].append(row[1]) # now pull out each message that is within the time context for eac= h matched message # as well as the matched messages themselves # return true is we should keep this message def determine(tup): if self.message in tup[2]: return True # matched message if seconds =3D=3D 0: return False # no time cont= ext specified times =3D messageTimes[tup[0]] # get the list of m= atched messages for this tool =20 le =3D bisect.bisect_right(times, tup[1]) # find time less th= an or equal to tup[1] ge =3D bisect.bisect_left(times, tup[1]) # find time greater= then to equal to tup[1] return (le and tup[1]-times[le-1] <=3D tdiff) or (ge !=3D len(t= imes) and times[ge]-tup[1] <=3D tdiff) cdata =3D [tup for tup in cdata if determine(tup)] In my test case, cdata started with 600k rows, 30k matched the users string= , and a total of 110k needed to be returned (which is what cdata ended up w= ith) - the 30k that matched the string, and 80k that were within the time t= hreshold.=20 I think the point you may have missed is the tool - I only return a row if = it's the same tool as a matched message and within the threshold.=20 I hope I've explained this better. Thanks again.