Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!eweka.nl!lightspeed.eweka.nl!194.134.4.91.MISMATCH!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'attribute': 0.05; 'diff': 0.05; 'executed': 0.07; 'list?': 0.07; 'matches': 0.07; 'subject:code': 0.07; 'tool,': 0.07; 'works.': 0.07; 'subject:help': 0.07; 'python': 0.09; '(1,': 0.09; 'dict': 0.09; 'exists,': 0.09; 'keyed': 0.09; 'logic': 0.09; 'loop.': 0.09; 'matched': 0.09; 'optimizing': 0.09; 'python:': 0.09; 'rows': 0.09; 'def': 0.10; 'value.': 0.15; '(1),': 0.16; 'better:': 0.16; 'clues': 0.16; 'dictionary.': 0.16; 'keys.': 0.16; 'loops': 0.16; 'parts.': 0.16; 'received:74.55.86': 0.16; 'received:74.55.86.74': 0.16; 'received:smtp.webfaction.com': 0.16; 'received:webfaction.com': 0.16; 'row': 0.16; 'run.': 0.16; 'set,': 0.16; 'subject:making': 0.16; 'tool.': 0.16; 'top-level': 0.16; 'tup': 0.16; 'string': 0.17; 'wrote:': 0.17; 'instance': 0.17; 'items.': 0.17; 'string,': 0.17; 'typical': 0.17; '>>>': 0.18; 'code,': 0.18; 'thanks.': 0.21; 'fixing': 0.22; 'object.': 0.22; 'simpler': 0.22; 'tuples': 0.22; 'elements': 0.23; 'matching': 0.23; 'specified': 0.23; 'statement': 0.23; 'this:': 0.23; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; '(which': 0.26; 'select': 0.26; 'question': 0.27; 'convention': 0.27; 'all.': 0.28; 'run': 0.28; 'record': 0.28; '>>>>': 0.29; 'changes:': 0.29; 'comparison': 0.29; 'long.': 0.29; 'way?': 0.29; 'no,': 0.29; 'probably': 0.29; "i'm": 0.29; 'that.': 0.30; 'e.g.': 0.30; 'query': 0.30; 'thursday,': 0.30; 'function': 0.30; 'code': 0.31; 'gets': 0.32; 'december': 0.32; 'not.': 0.32; 'could': 0.32; 'message.': 0.33; 'function.': 0.33; 'anyone': 0.33; 'to:addr :python-list': 0.33; 'likely': 0.33; 'themselves': 0.33; 'times.': 0.33; 'skip:d 20': 0.34; 'version': 0.34; 'self': 0.34; 'list': 0.35; 'false': 0.35; 'faster': 0.35; 'doing': 0.35; 'pm,': 0.35; 'too.': 0.35; 'there': 0.35; 'list.': 0.35; 'really': 0.36; 'tool': 0.36; 'but': 0.36; 'anything': 0.36; 'subject:with': 0.36; 'should': 0.36; 'too': 0.36; 'enough': 0.36; 'does': 0.37; 'why': 0.37; 'item': 0.37; 'far': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'easier': 0.38; 'mean': 0.38; 'some': 0.38; 'performance': 0.39; 'to:addr:python.org': 0.39; 'takes': 0.39; 'received:192': 0.39; 'called': 0.39; 'map': 0.61; 'first': 0.61; 'interest': 0.62; 'time,': 0.62; 'email addr:gmail.com': 0.63; 'more': 0.63; 'within': 0.64; 'making': 0.64; 'here': 0.65; '20,': 0.65; 'want,': 0.65; 'hours': 0.66; 'hour': 0.69; 'cut': 0.71; 'sounds': 0.71; 'actually,': 0.84; 'column.': 0.84; 'correctly?': 0.84; 'experiment': 0.84; 'judicious': 0.84; 'nearby': 0.84; '360': 0.91; 'faster.': 0.91; 'fragment': 0.91; 'angel': 0.93 Date: Thu, 20 Dec 2012 21:49:43 -0500 From: Mitya Sirenef User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: python-list@python.org Subject: Re: help with making my code more efficient References: <50D3CBCD.4080203@lightbird.net> In-Reply-To: <50D3CBCD.4080203@lightbird.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 145 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1356058187 news.xs4all.nl 6889 [2001:888:2000:d::a6]:42865 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:35262 On 12/20/2012 09:39 PM, Mitya Sirenef wrote: > On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote: >> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: >>> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote: >>> >>>> I have a list of tuples that contains a tool_id, a time, and a >>>> message. I want to select from this list all the elements where the >>>> message matches some string, and all the other elements where the >>>> time is within some diff of any matching message for that tool. >>>> Here is how I am currently doing this: >>> No, it's not. This is a fragment of code, without enough clues as to >>> >>> what else is going. We can guess, but that's likely to make a mess. >> Of course it's a fragment - it's part of a large program and I was >> just showing the relevant parts. >> >>> First question is whether this code works exactly correctly? >> Yes, the code works. I end up with just the rows I want. >> >>> Are you only concerned about speed, not fixing features? >> Don't know what you mean by 'fixing features'. The code does what I >> want, it just takes too long. >> >>> As far as I can tell, the logic that includes the time comparison is >>> bogus. >> Not at all. >> >>> You don't do anything there to worry about the value of tup[2], >>> just whether some >>> item has a nearby time. Of course, I could misunderstand the spec. >> The data comes from a database. tup[2] is a datetime column. tdiff >> comes from a datetime.timedelta() >> >>> Are you making a global called 'self' ? That name is by convention only >>> used in methods to designate the instance object. What's the attribute >>> self? >> Yes, self is my instance object. self.message contains the string of >> interest that I need to look for. >> >>> Can cdata have duplicates, and are they significant? >> No, it will not have duplicates. >> >>> Are you including the time building that as part of your 2 hour >>> measurement? >> No, the 2 hours is just the time to run the >> >> cdata[:] = [tup for tup in cdata if determine(tup)] >> >>> Is the list sorted in any way? >> Yes, the list is sorted by tool and datetime. >> >>> Chances are your performance bottleneck is the doubly-nested loop. You >>> have a list comprehension at top-level code, and inside it calls a >>> function that also loops over the 600,000 items. So the inner loop >>> gets >>> executed 360 billion times. You can cut this down drastically by some >>> judicious sorting, as well as by having a map of lists, where the >>> map is >>> keyed by the tool. >> Thanks. I will try that. >> >>>> # record time for each message matching the specified message for >>>> each tool >>>> messageTimes = {} >>> You're building a dictionary; are you actually using the value (1), or >>> is only the key relevant? >> Only the keys. >> >>> A set is a dict without a value. >> Yes, I could use a set, but I don't think that would make it >> measurably faster. >> >>> But more mportantly, you never look up anything in this dictionary. >>> So why >>> isn't it a list? For that matter, why don't you just use the >>> messageTimes list? >> Yes, it could be a list too. >>>> for row in cdata: # tool, time, message >>>> if self.message in row[2]: >>>> messageTimes[row[0], row[1]] = 1 >>>> # now pull out each message that is within the time diff for each >>>> matched message >>>> # as well as the matched messages themselves >>>> def determine(tup): >>>> if self.message in tup[2]: return True # matched message >>>> for (tool, date_time) in messageTimes: >>>> if tool == tup[0]: >>>> if abs(date_time-tup[1]) <= tdiff: >>>> return True >>>> return False >>>> cdata[:] = [tup for tup in cdata if determine(tup)] >>> >>> >>> As the code exists, there's no need to copy the list. Just do a simple >>> bind. >> This statement is to remove the items from cdata that I don't want. I >> don't know what you mean by bind. I'm not familiar with that python >> function. >> >>> >>> >>>> This code works, but it takes way too long to run - e.g. when cdata >>>> has 600,000 elements (which is typical for my app) it takes 2 hours >>>> for this to run. >>>> Can anyone give me some suggestions on speeding this up? > > > This code probably is not faster but it's simpler and may be easier > for you to work with > to experiment with speed-improving changes: > > > diffrng = 1 > > L = [ > # id, time, string > (1, 5, "ok"), > (1, 6, "ok"), > (1, 7, "no"), > (1, 8, "no"), > ] > > match_times = [t[1] for t in L if "ok" in t[2]] > > def in_range(timeval): > return bool( min([abs(timeval-v) for v in match_times]) <= diffrng ) > > print([t for t in L if in_range(t[1])]) > > > But it really sounds like you could look into optimizing the db > query and db indexes, etc. > > Actually, it might be slower.. this version of in_range should be better: def in_range(timeval): return any( abs(timeval-v) <= diffrng for v in match_times ) -- Lark's Tongue Guide to Python: http://lightbird.net/larks/