Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'attribute': 0.05; 'diff': 0.05; 'executed': 0.07; 'list?': 0.07; 'matches': 0.07; 'subject:code': 0.07; 'tool,': 0.07; 'works.': 0.07; 'subject:help': 0.07; 'python': 0.09; 'dict': 0.09; 'exists,': 0.09; 'keyed': 0.09; 'logic': 0.09; 'loop.': 0.09; 'matched': 0.09; 'rows': 0.09; 'to:addr:comp.lang.python': 0.09; 'cc:addr :python-list': 0.10; 'def': 0.10; 'value.': 0.15; '(1),': 0.16; 'clues': 0.16; 'dictionary.': 0.16; 'keys.': 0.16; 'loops': 0.16; 'parts.': 0.16; 'row': 0.16; 'run.': 0.16; 'set,': 0.16; 'subject:making': 0.16; 'tool.': 0.16; 'top-level': 0.16; 'tup': 0.16; 'string': 0.17; 'wrote:': 0.17; 'instance': 0.17; 'items.': 0.17; 'string,': 0.17; 'typical': 0.17; 'code,': 0.18; 'thanks.': 0.21; 'fixing': 0.22; 'object.': 0.22; 'tuples': 0.22; 'elements': 0.23; 'matching': 0.23; 'specified': 0.23; 'statement': 0.23; 'this:': 0.23; 'cc:2**1': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'select': 0.26; 'question': 0.27; 'convention': 0.27; 'all.': 0.28; 'run': 0.28; 'record': 0.28; 'comparison': 0.29; 'long.': 0.29; 'way?': 0.29; 'no,': 0.29; "i'm": 0.29; 'that.': 0.30; 'e.g.': 0.30; 'thursday,': 0.30; 'function': 0.30; 'code': 0.31; 'gets': 0.32; 'december': 0.32; 'not.': 0.32; 'could': 0.32; 'message.': 0.33; 'function.': 0.33; 'anyone': 0.33; 'likely': 0.33; 'themselves': 0.33; 'times.': 0.33; 'skip:d 20': 0.34; 'received:google.com': 0.34; 'self': 0.34; 'list': 0.35; 'false': 0.35; 'doing': 0.35; 'pm,': 0.35; 'too.': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'list.': 0.35; 'tool': 0.36; 'but': 0.36; 'cc:no real name:2**1': 0.36; 'anything': 0.36; 'subject:with': 0.36; 'too': 0.36; 'enough': 0.36; 'does': 0.37; 'why': 0.37; 'item': 0.37; 'received:209': 0.37; 'far': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'mean': 0.38; 'some': 0.38; 'performance': 0.39; 'takes': 0.39; 'called': 0.39; 'where': 0.40; 'end': 0.40; 'think': 0.40; 'your': 0.60; 'remove': 0.61; 'map': 0.61; 'first': 0.61; 'interest': 0.62; 'time,': 0.62; 'email addr:gmail.com': 0.63; 'more': 0.63; 'within': 0.64; 'making': 0.64; 'here': 0.65; '20,': 0.65; 'want,': 0.65; 'hours': 0.66; 'hour': 0.69; 'cut': 0.71; 'column.': 0.84; 'correctly?': 0.84; 'judicious': 0.84; 'nearby': 0.84; '360': 0.91; 'faster.': 0.91; 'fragment': 0.91; 'angel': 0.93 Newsgroups: comp.lang.python Date: Thu, 20 Dec 2012 17:46:38 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 68.84.146.219 MIME-Version: 1.0 Subject: Re: help with making my code more efficient From: "Larry.Martell@gmail.com" To: comp.lang.python@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org, d@davea.name X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Message-ID: Lines: 148 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356054403 news.xs4all.nl 6892 [2001:888:2000:d::a6]:47986 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35257 On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote: >=20 > > I have a list of tuples that contains a tool_id, a time, and a message.= I want to select from this list all the elements where the message matches= some string, and all the other elements where the time is within some diff= of any matching message for that tool.=20 >=20 > > Here is how I am currently doing this: >=20 > No, it's not. This is a fragment of code, without enough clues as to >=20 > what else is going. We can guess, but that's likely to make a mess. Of course it's a fragment - it's part of a large program and I was just sho= wing the relevant parts.=20 > First question is whether this code works exactly correctly? =20 Yes, the code works. I end up with just the rows I want.=20 > Are you only concerned about speed, not fixing features? =20 Don't know what you mean by 'fixing features'. The code does what I want, i= t just takes too long. > As far as I can tell, the logic that includes the time comparison is bogu= s. =20 Not at all.=20 > You don't do anything there to worry about the value of tup[2], just whe= ther some > item has a nearby time. Of course, I could misunderstand the spec. The data comes from a database. tup[2] is a datetime column. tdiff comes fr= om a datetime.timedelta()=20 > Are you making a global called 'self' ? That name is by convention only > used in methods to designate the instance object. What's the attribute > self? Yes, self is my instance object. self.message contains the string of intere= st that I need to look for.=20 > Can cdata have duplicates, and are they significant?=20 No, it will not have duplicates. > Are you including the time building that as part of your 2 hour measurem= ent?=20 No, the 2 hours is just the time to run the=20 cdata[:] =3D [tup for tup in cdata if determine(tup)] > Is the list sorted in any way? Yes, the list is sorted by tool and datetime. > Chances are your performance bottleneck is the doubly-nested loop. You > have a list comprehension at top-level code, and inside it calls a > function that also loops over the 600,000 items. So the inner loop gets > executed 360 billion times. You can cut this down drastically by some > judicious sorting, as well as by having a map of lists, where the map is > keyed by the tool. Thanks. I will try that. > > # record time for each message matching the specified message for each = tool >=20 > > messageTimes =3D {} > > You're building a dictionary; are you actually using the value (1), or > is only the key relevant? =20 Only the keys. > A set is a dict without a value. =20 Yes, I could use a set, but I don't think that would make it measurably fas= ter.=20 > But more mportantly, you never look up anything in this dictionary. So w= hy > isn't it a list? For that matter, why don't you just use the > messageTimes list? Yes, it could be a list too. =20 > > for row in cdata: # tool, time, message >=20 > > if self.message in row[2]: >=20 > > messageTimes[row[0], row[1]] =3D 1 >=20 > > >=20 > > # now pull out each message that is within the time diff for each match= ed message >=20 > > # as well as the matched messages themselves >=20 > > >=20 > > def determine(tup): >=20 > > if self.message in tup[2]: return True # matched message=20 >=20 > > >=20 > > for (tool, date_time) in messageTimes: >=20 > > if tool =3D=3D tup[0]: >=20 > > if abs(date_time-tup[1]) <=3D tdiff:=20 >=20 > > return True >=20 > > >=20 > > return False >=20 > > =20 >=20 > > cdata[:] =3D [tup for tup in cdata if determine(tup)] >=20 >=20 >=20 > As the code exists, there's no need to copy the list. Just do a simple > bind. This statement is to remove the items from cdata that I don't want. I don't= know what you mean by bind. I'm not familiar with that python function.=20 >=20 >=20 >=20 > > >=20 > > This code works, but it takes way too long to run - e.g. when cdata has= 600,000 elements (which is typical for my app) it takes 2 hours for this t= o run.=20 >=20 > > >=20 > > Can anyone give me some suggestions on speeding this up?