Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #35242 > unrolled thread

help with making my code more efficient

Started by"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
First post2012-12-20 16:19 -0800
Last post2012-12-21 02:08 +0000
Articles 20 on this page of 23 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:19 -0800
    Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 11:38 +1100
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
        Re: help with making my code more efficient Chris Angelico <rosuav@gmail.com> - 2012-12-21 13:56 +1100
        Re: help with making my code more efficient Roy Smith <roy@panix.com> - 2012-12-20 22:30 -0500
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 16:43 -0800
    Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 20:17 -0500
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
        Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:39 -0500
        Re: help with making my code more efficient Mitya Sirenef <msirenef@lightbird.net> - 2012-12-20 21:49 -0500
        Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-20 22:31 -0500
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
          Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 09:57 -0800
            Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
              Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-21 22:19 -0500
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
                  Re: help with making my code more efficient Dave Angel <d@davea.name> - 2012-12-22 01:47 -0500
                    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
                    Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-24 09:57 -0800
                Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 20:47 -0800
            Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-21 12:36 -0800
      Re: help with making my code more efficient "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> - 2012-12-20 17:46 -0800
    Re: help with making my code more efficient MRAB <python@mrabarnett.plus.com> - 2012-12-21 02:08 +0000

Page 1 of 2  [1] 2  Next page →


#35242 — help with making my code more efficient

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-20 16:19 -0800
Subjecthelp with making my code more efficient
Message-ID<bd5ae6b7-2440-42e4-a93c-eb877feebcfe@googlegroups.com>
I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool. 

Here is how I am currently doing this:

# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata:   # tool, time, message
    if self.message in row[2]:
        messageTimes[row[0], row[1]] = 1

# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves

def determine(tup):
    if self.message in tup[2]: return True      # matched message 

    for (tool, date_time) in messageTimes:
        if tool == tup[0]:
            if abs(date_time-tup[1]) <= tdiff: 
               return True

    return False
        
cdata[:] = [tup for tup in cdata if determine(tup)]

This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run. 

Can anyone give me some suggestions on speeding this up?

TIA!

[toc] | [next] | [standalone]


#35246

FromChris Angelico <rosuav@gmail.com>
Date2012-12-21 11:38 +1100
Message-ID<mailman.1123.1356050292.29569.python-list@python.org>
In reply to#35242
On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>
> Can anyone give me some suggestions on speeding this up?
>

It sounds like you may have enough data to want to not keep it all in
memory. Have you considered switching to a database? You could then
execute SQL queries against it.

ChrisA

[toc] | [prev] | [next] | [standalone]


#35247

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-20 16:43 -0800
Message-ID<cc869959-c568-4490-b45f-7855c6841575@googlegroups.com>
In reply to#35246
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
> 
> <Larry.Martell@gmail.com> wrote:
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?
> 
> >
> 
> 
> 
> It sounds like you may have enough data to want to not keep it all in
> 
> memory. Have you considered switching to a database? You could then
> 
> execute SQL queries against it.

It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python. 

[toc] | [prev] | [next] | [standalone]


#35263

FromChris Angelico <rosuav@gmail.com>
Date2012-12-21 13:56 +1100
Message-ID<mailman.1133.1356058620.29569.python-list@python.org>
In reply to#35247
On Fri, Dec 21, 2012 at 11:43 AM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python.

Then the best thing to do is figure out how to solve your problem in
SQL. Any decent database engine will be able to optimize that
beautifully, and without multiple recursive searches. You may need to
create an index, but maybe not even that.

I can't speak for other engines, but PostgreSQL has an excellently
helpful mailing list, if you have problems with that side of it. But
have a shot at writing the SQL; chances are it'll work out easily.

ChrisA

[toc] | [prev] | [next] | [standalone]


#35265

FromRoy Smith <roy@panix.com>
Date2012-12-20 22:30 -0500
Message-ID<roy-46E00A.22303120122012@news.panix.com>
In reply to#35247
In article <cc869959-c568-4490-b45f-7855c6841575@googlegroups.com>,
 "Larry.Martell@gmail.com" <Larry.Martell@gmail.com> wrote:

> On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> > On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
> > 
> > <Larry.Martell@gmail.com> wrote:
> > 
> > > This code works, but it takes way too long to run - e.g. when cdata has 
> > > 600,000 elements (which is typical for my app) it takes 2 hours for this 
> > > to run.
> > 
> > >
> > 
> > > Can anyone give me some suggestions on speeding this up?
> > 
> > >
> > 
> > 
> > 
> > It sounds like you may have enough data to want to not keep it all in
> > 
> > memory. Have you considered switching to a database? You could then
> > 
> > execute SQL queries against it.
> 
> It came from a database. Originally I was getting just the data I wanted 
> using SQL, but that was taking too long also. I was selecting just the 
> messages I wanted, then for each one of those doing another query to get the 
> data within the time diff of each. That was resulting in tens of thousands of 
> queries. So I changed it to pull all the potential matches at once and then 
> process it in python. 

If you're doing free-text matching, an SQL database may not be the right 
tool.  I suspect you want to be looking at some kind of text search 
engine, such as http://lucene.apache.org/ or http://xapian.org/.

[toc] | [prev] | [next] | [standalone]


#35254

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-20 16:43 -0800
Message-ID<mailman.1127.1356053559.29569.python-list@python.org>
In reply to#35246
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
> 
> <Larry.Martell@gmail.com> wrote:
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?
> 
> >
> 
> 
> 
> It sounds like you may have enough data to want to not keep it all in
> 
> memory. Have you considered switching to a database? You could then
> 
> execute SQL queries against it.

It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python. 

[toc] | [prev] | [next] | [standalone]


#35252

FromDave Angel <d@davea.name>
Date2012-12-20 20:17 -0500
Message-ID<mailman.1126.1356052648.29569.python-list@python.org>
In reply to#35242
On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool. 
>
> Here is how I am currently doing this:

No, it's not.  This is a fragment of code, without enough clues as to
what else is going.  We can guess, but that's likely to make a mess.

First question is whether this code works exactly correctly?  Are you
only concerned about speed, not fixing features?  As far as I can tell,
the logic that includes the time comparison is bogus.  You don't do
anything there to worry about the value of tup[2], just whether some
item has a nearby time.  Of course, I could misunderstand the spec.

Are you making a global called 'self' ?  That name is by convention only
used in methods to designate the instance object.  What's the attribute
self?

Can cdata have duplicates, and are they significant?  Are you including
the time building that as part of your 2 hour measurement?  Is the list
sorted in any way?

Chances are your performance bottleneck is the doubly-nested loop.  You
have a list comprehension at top-level code, and inside it calls a
function that also loops over the 600,000 items.  So the inner loop gets
executed 360 billion times.  You can cut this down drastically by some
judicious sorting, as well as by having a map of lists, where the map is
keyed by the tool.

>
> # record time for each message matching the specified message for each tool
> messageTimes = {}

You're building a dictionary;  are you actually using the value (1), or
is only the key relevant?  A set is a dict without a value.  But more
importantly, you never look up anything in this dictionary.  So why
isn't it a list?  For that matter, why don't you just use the
messageTimes list?

> for row in cdata:   # tool, time, message
>     if self.message in row[2]:
>         messageTimes[row[0], row[1]] = 1
>
> # now pull out each message that is within the time diff for each matched message
> # as well as the matched messages themselves
>
> def determine(tup):
>     if self.message in tup[2]: return True      # matched message 
>
>     for (tool, date_time) in messageTimes:
>         if tool == tup[0]:
>             if abs(date_time-tup[1]) <= tdiff: 
>                return True
>
>     return False
>         
> cdata[:] = [tup for tup in cdata if determine(tup)]

As the code exists, there's no need to copy the list.  Just do a simple
bind.

>
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run. 
>
> Can anyone give me some suggestions on speeding this up?
>
> TIA!


-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#35256

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-20 17:46 -0800
Message-ID<d6aaa5b5-7d21-4018-ba9a-ea354b15b6c5@googlegroups.com>
In reply to#35252
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
> 
> > I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool. 
> 
> > Here is how I am currently doing this:
> 
> No, it's not.  This is a fragment of code, without enough clues as to
> 
> what else is going.  We can guess, but that's likely to make a mess.

Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 

> First question is whether this code works exactly correctly?  

Yes, the code works. I end up with just the rows I want. 

> Are you only concerned about speed, not fixing features?  

Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogus.  

Not at all. 

> You don't do  anything there to worry about the value of tup[2], just whether some
> item has a nearby time.  Of course, I could misunderstand the spec.

The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 

> Are you making a global called 'self' ?  That name is by convention only
> used in methods to designate the instance object.  What's the attribute
> self?

Yes, self is my instance object. self.message contains the string of interest that I need to look for. 

> Can cdata have duplicates, and are they significant? 

No, it will not have duplicates.

> Are you including  the time building that as part of your 2 hour measurement? 

No, the 2 hours is just the time to run the 

cdata[:] = [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?

Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop.  You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items.  So the inner loop gets
> executed 360 billion times.  You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.

Thanks. I will try that.

> > # record time for each message matching the specified message for each tool
> 
> > messageTimes = {}
>
> You're building a dictionary;  are you actually using the value (1), or
>  is only the key relevant?  

Only the keys.

> A set is a dict without a value.  

Yes, I could use a set, but I don't think that would make it measurably faster. 

> But more mportantly, you never look up anything in this dictionary.  So why
> isn't it a list?  For that matter, why don't you just use the
> messageTimes list?

Yes, it could be a list too.
 
> > for row in cdata:   # tool, time, message
> 
> >     if self.message in row[2]:
> 
> >         messageTimes[row[0], row[1]] = 1
> 
> >
> 
> > # now pull out each message that is within the time diff for each matched message
> 
> > # as well as the matched messages themselves
> 
> >
> 
> > def determine(tup):
> 
> >     if self.message in tup[2]: return True      # matched message 
> 
> >
> 
> >     for (tool, date_time) in messageTimes:
> 
> >         if tool == tup[0]:
> 
> >             if abs(date_time-tup[1]) <= tdiff: 
> 
> >                return True
> 
> >
> 
> >     return False
> 
> >         
> 
> > cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> 
> 
> As the code exists, there's no need to copy the list.  Just do a simple
> bind.

This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 

> 
> 
> 
> >
> 
> > This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run. 
> 
> >
> 
> > Can anyone give me some suggestions on speeding this up?

[toc] | [prev] | [next] | [standalone]


#35261

FromMitya Sirenef <msirenef@lightbird.net>
Date2012-12-20 21:39 -0500
Message-ID<mailman.1131.1356057593.29569.python-list@python.org>
In reply to#35256
On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>>
>>> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
>>> Here is how I am currently doing this:
>> No, it's not.  This is a fragment of code, without enough clues as to
>>
>> what else is going.  We can guess, but that's likely to make a mess.
> Of course it's a fragment - it's part of a large program and I was just showing the relevant parts.
>
>> First question is whether this code works exactly correctly?
> Yes, the code works. I end up with just the rows I want.
>
>> Are you only concerned about speed, not fixing features?
> Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
>
>> As far as I can tell, the logic that includes the time comparison is bogus.
> Not at all.
>
>> You don't do  anything there to worry about the value of tup[2], just whether some
>> item has a nearby time.  Of course, I could misunderstand the spec.
> The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta()
>
>> Are you making a global called 'self' ?  That name is by convention only
>> used in methods to designate the instance object.  What's the attribute
>> self?
> Yes, self is my instance object. self.message contains the string of interest that I need to look for.
>
>> Can cdata have duplicates, and are they significant?
> No, it will not have duplicates.
>
>> Are you including  the time building that as part of your 2 hour measurement?
> No, the 2 hours is just the time to run the
>
> cdata[:] = [tup for tup in cdata if determine(tup)]
>
>> Is the list sorted in any way?
> Yes, the list is sorted by tool and datetime.
>
>> Chances are your performance bottleneck is the doubly-nested loop.  You
>> have a list comprehension at top-level code, and inside it calls a
>> function that also loops over the 600,000 items.  So the inner loop gets
>> executed 360 billion times.  You can cut this down drastically by some
>> judicious sorting, as well as by having a map of lists, where the map is
>> keyed by the tool.
> Thanks. I will try that.
>
>>> # record time for each message matching the specified message for each tool
>>> messageTimes = {}
>> You're building a dictionary;  are you actually using the value (1), or
>>   is only the key relevant?
> Only the keys.
>
>> A set is a dict without a value.
> Yes, I could use a set, but I don't think that would make it measurably faster.
>
>> But more mportantly, you never look up anything in this dictionary.  So why
>> isn't it a list?  For that matter, why don't you just use the
>> messageTimes list?
> Yes, it could be a list too.
>   
>>> for row in cdata:   # tool, time, message
>>>      if self.message in row[2]:
>>>          messageTimes[row[0], row[1]] = 1
>>> # now pull out each message that is within the time diff for each matched message
>>> # as well as the matched messages themselves
>>> def determine(tup):
>>>      if self.message in tup[2]: return True      # matched message
>>>      for (tool, date_time) in messageTimes:
>>>          if tool == tup[0]:
>>>              if abs(date_time-tup[1]) <= tdiff:
>>>                 return True
>>>      return False
>>>          
>>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>
>> As the code exists, there's no need to copy the list.  Just do a simple
>> bind.
> This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function.
>
>>
>>
>>> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>>> Can anyone give me some suggestions on speeding this up?


This code probably is not faster but it's simpler and may be easier for 
you to work with
to experiment with speed-improving changes:


diffrng = 1

L = [
      # id, time, string
      (1, 5, "ok"),
      (1, 6, "ok"),
      (1, 7, "no"),
      (1, 8, "no"),
      ]

match_times = [t[1] for t in L if "ok" in t[2]]

def in_range(timeval):
     return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )

print([t for t in L if in_range(t[1])])


But it really sounds like you could look into optimizing the db
query and db indexes, etc.


-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/

[toc] | [prev] | [next] | [standalone]


#35262

FromMitya Sirenef <msirenef@lightbird.net>
Date2012-12-20 21:49 -0500
Message-ID<mailman.1132.1356058187.29569.python-list@python.org>
In reply to#35256
On 12/20/2012 09:39 PM, Mitya Sirenef wrote:
> On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
>> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>>> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>>>
>>>> I have a list of tuples that contains a tool_id, a time, and a 
>>>> message. I want to select from this list all the elements where the 
>>>> message matches some string, and all the other elements where the 
>>>> time is within some diff of any matching message for that tool.
>>>> Here is how I am currently doing this:
>>> No, it's not.  This is a fragment of code, without enough clues as to
>>>
>>> what else is going.  We can guess, but that's likely to make a mess.
>> Of course it's a fragment - it's part of a large program and I was 
>> just showing the relevant parts.
>>
>>> First question is whether this code works exactly correctly?
>> Yes, the code works. I end up with just the rows I want.
>>
>>> Are you only concerned about speed, not fixing features?
>> Don't know what you mean by 'fixing features'. The code does what I 
>> want, it just takes too long.
>>
>>> As far as I can tell, the logic that includes the time comparison is 
>>> bogus.
>> Not at all.
>>
>>> You don't do  anything there to worry about the value of tup[2], 
>>> just whether some
>>> item has a nearby time.  Of course, I could misunderstand the spec.
>> The data comes from a database. tup[2] is a datetime column. tdiff 
>> comes from a datetime.timedelta()
>>
>>> Are you making a global called 'self' ? That name is by convention only
>>> used in methods to designate the instance object.  What's the attribute
>>> self?
>> Yes, self is my instance object. self.message contains the string of 
>> interest that I need to look for.
>>
>>> Can cdata have duplicates, and are they significant?
>> No, it will not have duplicates.
>>
>>> Are you including  the time building that as part of your 2 hour 
>>> measurement?
>> No, the 2 hours is just the time to run the
>>
>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>> Is the list sorted in any way?
>> Yes, the list is sorted by tool and datetime.
>>
>>> Chances are your performance bottleneck is the doubly-nested loop.  You
>>> have a list comprehension at top-level code, and inside it calls a
>>> function that also loops over the 600,000 items.  So the inner loop 
>>> gets
>>> executed 360 billion times.  You can cut this down drastically by some
>>> judicious sorting, as well as by having a map of lists, where the 
>>> map is
>>> keyed by the tool.
>> Thanks. I will try that.
>>
>>>> # record time for each message matching the specified message for 
>>>> each tool
>>>> messageTimes = {}
>>> You're building a dictionary;  are you actually using the value (1), or
>>>   is only the key relevant?
>> Only the keys.
>>
>>> A set is a dict without a value.
>> Yes, I could use a set, but I don't think that would make it 
>> measurably faster.
>>
>>> But more mportantly, you never look up anything in this dictionary.  
>>> So why
>>> isn't it a list?  For that matter, why don't you just use the
>>> messageTimes list?
>> Yes, it could be a list too.
>>>> for row in cdata:   # tool, time, message
>>>>      if self.message in row[2]:
>>>>          messageTimes[row[0], row[1]] = 1
>>>> # now pull out each message that is within the time diff for each 
>>>> matched message
>>>> # as well as the matched messages themselves
>>>> def determine(tup):
>>>>      if self.message in tup[2]: return True      # matched message
>>>>      for (tool, date_time) in messageTimes:
>>>>          if tool == tup[0]:
>>>>              if abs(date_time-tup[1]) <= tdiff:
>>>>                 return True
>>>>      return False
>>>>          cdata[:] = [tup for tup in cdata if determine(tup)]
>>>
>>>
>>> As the code exists, there's no need to copy the list.  Just do a simple
>>> bind.
>> This statement is to remove the items from cdata that I don't want. I 
>> don't know what you mean by bind. I'm not familiar with that python 
>> function.
>>
>>>
>>>
>>>> This code works, but it takes way too long to run - e.g. when cdata 
>>>> has 600,000 elements (which is typical for my app) it takes 2 hours 
>>>> for this to run.
>>>> Can anyone give me some suggestions on speeding this up?
>
>
> This code probably is not faster but it's simpler and may be easier 
> for you to work with
> to experiment with speed-improving changes:
>
>
> diffrng = 1
>
> L = [
>      # id, time, string
>      (1, 5, "ok"),
>      (1, 6, "ok"),
>      (1, 7, "no"),
>      (1, 8, "no"),
>      ]
>
> match_times = [t[1] for t in L if "ok" in t[2]]
>
> def in_range(timeval):
>     return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )
>
> print([t for t in L if in_range(t[1])])
>
>
> But it really sounds like you could look into optimizing the db
> query and db indexes, etc.
>
>

Actually, it might be slower.. this version of in_range should be better:

def in_range(timeval):
     return any( abs(timeval-v) <= diffrng for v in match_times )



-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/

[toc] | [prev] | [next] | [standalone]


#35266

FromDave Angel <d@davea.name>
Date2012-12-20 22:31 -0500
Message-ID<mailman.1134.1356060700.29569.python-list@python.org>
In reply to#35256
On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>> <snip>
> Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 
But it seems these are methods in a class, or something, so we're
missing context.  And you use self without it being an argument to the
function.  Like it's a global.
> <snip>
> Yes, the code works. I end up with just the rows I want.
>> Are you only concerned about speed, not fixing features?  
> Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
>
>> As far as I can tell, the logic that includes the time comparison is bogus.  
> Not at all. 
>
>> You don't do  anything there to worry about the value of tup[2], just whether some
>> item has a nearby time.  Of course, I could misunderstand the spec.
> The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 
I thought that tup[1] was the datetime.  In any case, the loop makes no
sense to me, so I can't really optimize it, just make suggestions.
>
>> Are you making a global called 'self' ?  That name is by convention only
>> used in methods to designate the instance object.  What's the attribute
>> self?
> Yes, self is my instance object. self.message contains the string of interest that I need to look for. 
>
>> Can cdata have duplicates, and are they significant? 
> No, it will not have duplicates.
>
>> Is the list sorted in any way?
> Yes, the list is sorted by tool and datetime.
>
>> Chances are your performance bottleneck is the doubly-nested loop.  You
>> have a list comprehension at top-level code, and inside it calls a
>> function that also loops over the 600,000 items.  So the inner loop gets
>> executed 360 billion times.  You can cut this down drastically by some
>> judicious sorting, as well as by having a map of lists, where the map is
>> keyed by the tool.
> Thanks. I will try that.

So in your first loop, you could simply split the list into separate
lists, one per tup[0] value, and the lists as dictionary items, keyed by
that tool string.

Then inside the determine() function, make a local ref to the particular
list for the tool.
   recs = messageTimes[tup[0]]

Instead of a for loop over recs, use a binary search to identify the
first item that's >= date_time-tdiff.  Then if it's less than
date_time+tdiff, return True, otherwise False.  Check out the bisect
module.  Function bisect_left() should do what you want in a sorted list.


>>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>
>> As the code exists, there's no need to copy the list.  Just do a simple
>> bind.
> This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 

Every "assignment" to a simple name is really a rebinding of that name.

cdata = [tup for tup in cdata if determine(tup)]

will rebind the name to the new object, much quicker than copying.  If
this is indeed a top-level line, it should be equivalent.  But if in
fact this is inside some other function, it may violate some other
assumptions.  In particular, if there are other names for the same
object, then you're probably stuck with modifying it in place, using
slice notation.

BTW, a set is generally much more memory efficient than a dict, when you
don't use the "value".  But since I think you'll be better off with a
dict of lists, it's a moot point.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#35310

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-21 09:57 -0800
Message-ID<0eb7fdc9-a733-4600-8e39-84caed3e37ac@googlegroups.com>
In reply to#35266
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
> 
> > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> 
> >> <snip>
> 
> > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 
> 
> But it seems these are methods in a class, or something, so we're
> missing context.  And you use self without it being an argument to the
> function.  Like it's a global.

I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as:

    def generate_data(self):

> 
> > <snip>
> 
> > Yes, the code works. I end up with just the rows I want.
> 
> >> Are you only concerned about speed, not fixing features?  
> 
> > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
> 
> >
> 
> >> As far as I can tell, the logic that includes the time comparison is bogus.  
> 
> > Not at all. 
> 
> >
> 
> >> You don't do  anything there to worry about the value of tup[2], just whether some
> 
> >> item has a nearby time.  Of course, I could misunderstand the spec.
> 
> > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 
> 
> I thought that tup[1] was the datetime.  In any case, the loop makes no
> sense to me, so I can't really optimize it, just make suggestions.

Yes, tup[1] is the datetime. I mistyped last night. 

> >> Are you making a global called 'self' ?  That name is by convention only
> 
> >> used in methods to designate the instance object.  What's the attribute
> 
> >> self?
> 
> > Yes, self is my instance object. self.message contains the string of interest that I need to look for. 
> 
> >
> 
> >> Can cdata have duplicates, and are they significant? 
> 
> > No, it will not have duplicates.
> 
> >
> 
> >> Is the list sorted in any way?
> 
> > Yes, the list is sorted by tool and datetime.
> 
> >
> 
> >> Chances are your performance bottleneck is the doubly-nested loop.  You
> 
> >> have a list comprehension at top-level code, and inside it calls a
> 
> >> function that also loops over the 600,000 items.  So the inner loop gets
> 
> >> executed 360 billion times.  You can cut this down drastically by some
> 
> >> judicious sorting, as well as by having a map of lists, where the map is
> 
> >> keyed by the tool.
> 
> > Thanks. I will try that.
> 
> 
> 
> So in your first loop, you could simply split the list into separate
> lists, one per tup[0] value, and the lists as dictionary items, keyed by
> that tool string.
> 
> Then inside the determine() function, make a local ref to the particular
> 
> list for the tool.
> 
>    recs = messageTimes[tup[0]]

I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. 

> Instead of a for loop over recs, use a binary search to identify the
> first item that's >= date_time-tdiff.  Then if it's less than
> date_time+tdiff, return True, otherwise False.  Check out the bisect
> module.  Function bisect_left() should do what you want in a sorted list.

Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. This was the code I had:

times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

> >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> >> As the code exists, there's no need to copy the list.  Just do a simple 
> >> bind.
> 
> > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 
> 
> Every "assignment" to a simple name is really a rebinding of that name.
> cdata = [tup for tup in cdata if determine(tup)]
> 
> will rebind the name to the new object, much quicker than copying.  If
> this is indeed a top-level line, it should be equivalent.  But if in
> fact this is inside some other function, it may violate some other
> assumptions.  In particular, if there are other names for the same
> object, then you're probably stuck with modifying it in place, using
> slice notation.

The slice notation  was left over when when cdata was a tuple. Now that it's a list I don't need that any more. 

> BTW, a set is generally much more memory efficient than a dict, when you
> don't use the "value".  But since I think you'll be better off with a
> dict of lists, it's a moot point.

I'm going back to square 1 and try and do all from SQL.

[toc] | [prev] | [next] | [standalone]


#35311

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-21 09:57 -0800
Message-ID<mailman.1163.1356112649.29569.python-list@python.org>
In reply to#35266
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
> 
> > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> 
> >> <snip>
> 
> > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 
> 
> But it seems these are methods in a class, or something, so we're
> missing context.  And you use self without it being an argument to the
> function.  Like it's a global.

I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as:

    def generate_data(self):

> 
> > <snip>
> 
> > Yes, the code works. I end up with just the rows I want.
> 
> >> Are you only concerned about speed, not fixing features?  
> 
> > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
> 
> >
> 
> >> As far as I can tell, the logic that includes the time comparison is bogus.  
> 
> > Not at all. 
> 
> >
> 
> >> You don't do  anything there to worry about the value of tup[2], just whether some
> 
> >> item has a nearby time.  Of course, I could misunderstand the spec.
> 
> > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 
> 
> I thought that tup[1] was the datetime.  In any case, the loop makes no
> sense to me, so I can't really optimize it, just make suggestions.

Yes, tup[1] is the datetime. I mistyped last night. 

> >> Are you making a global called 'self' ?  That name is by convention only
> 
> >> used in methods to designate the instance object.  What's the attribute
> 
> >> self?
> 
> > Yes, self is my instance object. self.message contains the string of interest that I need to look for. 
> 
> >
> 
> >> Can cdata have duplicates, and are they significant? 
> 
> > No, it will not have duplicates.
> 
> >
> 
> >> Is the list sorted in any way?
> 
> > Yes, the list is sorted by tool and datetime.
> 
> >
> 
> >> Chances are your performance bottleneck is the doubly-nested loop.  You
> 
> >> have a list comprehension at top-level code, and inside it calls a
> 
> >> function that also loops over the 600,000 items.  So the inner loop gets
> 
> >> executed 360 billion times.  You can cut this down drastically by some
> 
> >> judicious sorting, as well as by having a map of lists, where the map is
> 
> >> keyed by the tool.
> 
> > Thanks. I will try that.
> 
> 
> 
> So in your first loop, you could simply split the list into separate
> lists, one per tup[0] value, and the lists as dictionary items, keyed by
> that tool string.
> 
> Then inside the determine() function, make a local ref to the particular
> 
> list for the tool.
> 
>    recs = messageTimes[tup[0]]

I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. 

> Instead of a for loop over recs, use a binary search to identify the
> first item that's >= date_time-tdiff.  Then if it's less than
> date_time+tdiff, return True, otherwise False.  Check out the bisect
> module.  Function bisect_left() should do what you want in a sorted list.

Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. This was the code I had:

times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

> >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> >> As the code exists, there's no need to copy the list.  Just do a simple 
> >> bind.
> 
> > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 
> 
> Every "assignment" to a simple name is really a rebinding of that name.
> cdata = [tup for tup in cdata if determine(tup)]
> 
> will rebind the name to the new object, much quicker than copying.  If
> this is indeed a top-level line, it should be equivalent.  But if in
> fact this is inside some other function, it may violate some other
> assumptions.  In particular, if there are other names for the same
> object, then you're probably stuck with modifying it in place, using
> slice notation.

The slice notation  was left over when when cdata was a tuple. Now that it's a list I don't need that any more. 

> BTW, a set is generally much more memory efficient than a dict, when you
> don't use the "value".  But since I think you'll be better off with a
> dict of lists, it's a moot point.

I'm going back to square 1 and try and do all from SQL.

[toc] | [prev] | [next] | [standalone]


#35315

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-21 12:36 -0800
Message-ID<b334fb0c-0e76-400d-90e4-1ecde1cc2090@googlegroups.com>
In reply to#35311
On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> > On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote: 
> > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: 
> > >> <snip>
 
> > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. 
 
> > But it seems these are methods in a class, or something, so we're 
> > missing context.  And you use self without it being an argument to the 
> > function.  Like it's a global.

> I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as:
>  
>     def generate_data(self):

> > > <snip>
 
> > > Yes, the code works. I end up with just the rows I want.
  
> > >> Are you only concerned about speed, not fixing features?  
 
> > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
 
> > >> As far as I can tell, the logic that includes the time comparison is bogus.  
 
> > > Not at all. 

> > >> You don't do  anything there to worry about the value of tup[2], just whether some
> > >> item has a nearby time.  Of course, I could misunderstand the spec.

> > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() 
 
> > I thought that tup[1] was the datetime.  In any case, the loop makes no
> > sense to me, so I can't really optimize it, just make suggestions.
 
> Yes, tup[1] is the datetime. I mistyped last night. 

> > >> Are you making a global called 'self' ?  That name is by convention only
> > >> used in methods to designate the instance object.  What's the attribute
> > >> self?

> > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. 

> > >> Can cdata have duplicates, and are they significant? 

> > > No, it will not have duplicates.

> > >> Is the list sorted in any way?
 
> > > Yes, the list is sorted by tool and datetime.

> > >> Chances are your performance bottleneck is the doubly-nested loop.  You

> > >> have a list comprehension at top-level code, and inside it calls a

> > >> function that also loops over the 600,000 items.  So the inner loop gets

> > >> executed 360 billion times.  You can cut this down drastically by some

> > >> judicious sorting, as well as by having a map of lists, where the map is
> > >> keyed by the tool.

> > > Thanks. I will try that.

> > So in your first loop, you could simply split the list into separate 
> > lists, one per tup[0] value, and the lists as dictionary items, keyed by
> > that tool string.

> > Then inside the determine() function, make a local ref to the particular 
> > list for the tool.
> >    recs = messageTimes[tup[0]]

> I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. 

> > Instead of a for loop over recs, use a binary search to identify the 
> > first item that's >= date_time-tdiff.  Then if it's less than 
> > date_time+tdiff, return True, otherwise False.  Check out the bisect 
> > module.  Function bisect_left() should do what you want in a sorted list.

> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 

The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 

Thanks very much for all your help. 




> This was the code I had:
>  
> times = messageTimes[tup[0]]
> 
> le = bisect.bisect_right(times, tup[1])
> 
> ge = bisect.bisect_left(times, tup[1])
> 
> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> > >>> cdata[:] = [tup for tup in cdata if determine(tup)]
> 
> > 
> 
> > >> As the code exists, there's no need to copy the list.  Just do a simple 
> 
> > >> bind.
> 
> > 
> 
> > > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. 
> 
> > 
> 
> > Every "assignment" to a simple name is really a rebinding of that name.
> 
> > cdata = [tup for tup in cdata if determine(tup)]
> 
> > 
> 
> > will rebind the name to the new object, much quicker than copying.  If
> 
> > this is indeed a top-level line, it should be equivalent.  But if in
> 
> > fact this is inside some other function, it may violate some other
> 
> > assumptions.  In particular, if there are other names for the same
> 
> > object, then you're probably stuck with modifying it in place, using
> 
> > slice notation.
> 
> 
> 
> The slice notation  was left over when when cdata was a tuple. Now that it's a list I don't need that any more. 
> 
> 
> 
> > BTW, a set is generally much more memory efficient than a dict, when you
> 
> > don't use the "value".  But since I think you'll be better off with a
> 
> > dict of lists, it's a moot point.
> 
> 
> 
> I'm going back to square 1 and try and do all from SQL.

[toc] | [prev] | [next] | [standalone]


#35336

FromDave Angel <d@davea.name>
Date2012-12-21 22:19 -0500
Message-ID<mailman.1183.1356146412.29569.python-list@python.org>
In reply to#35315
On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
>> <snip>
>> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
>
> Thanks very much for all your help. 
>
>

Congratulations. But you're not done. A fourfold improvement isn't
nearly as much as I'd expect.  When you go from a n-squared algorithm to
an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold
improvement.  Assuming we're talking just about time spent in the cdata
 = list-comprehension list.

It's also probably possible to simplify, and get some speedup from other
parts of the code other than that one loop.  For example, if the bisect
is really working right, it's probably unnecessary to even copy the
original cdata.  You said it was sorted.  So bisect it directly, and
skip the messageTimes intermediate data structure.  It may not help, but
i suspect it will.



>
>> This was the code I had:
>>  
>> times = messageTimes[tup[0]]
>>
>> le = bisect.bisect_right(times, tup[1])
>>
>> ge = bisect.bisect_left(times, tup[1])
>>
>> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

I'm stymied making further suggestions to this fragment.  Without seeing
the changes you apparently made to messageTimes, I can't even be sure
this is equivalent.  And I wonder if even looking up tup[1] is
worthwhile, since your caller already knows the index in cdata.  If you
used cdata directly, you might not need a sort at all.

I wish you could post some sample data, and identify which records are
intended to be True.  Obviously you could simplify, and just use ints
where it's using datetimes here.  But if your rule is simply accept all
records that come in a cluster with no delta more than a threshold, then
you don't even need any search at all.  Just pass the index in, and
compare the datetime with its predecessor and successor.

Could we see the whole fragment as you originally had it, but with the
optimizations you've done so far?  The more I look at that determine()
function, the more futile it seems.  I just don't understand what the
criteria are supposed to be.  Your list-comp loops through all of cdata,
and for each item, passes that item to determine().  determine() loops
through a copy of cdata, checking to see if any of them is within tdiff
of tup.  But won't that always return True, since you'll encounter the
record and compare it to itself?


-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#35337

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-21 20:47 -0800
Message-ID<81a9e131-3fab-438e-8799-5aa5bd0e1559@googlegroups.com>
In reply to#35336
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> 
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> 
> >> <snip>
> 
> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> 
> > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
> 
> >
> 
> > Thanks very much for all your help. 
> 
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold 
> improvement.  Assuming we're talking just about time spent in the cdata 
>  = list-comprehension list.
> 
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
> 
> >> This was the code I had:
> >> 
> >> times = messageTimes[tup[0]]
> 
> >>
> 
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure 
> this is equivalent.  And I wonder if even looking up tup[1] is 
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
>  
> I wish you could post some sample data, and identify which records are 
> intended to be True.  Obviously you could simplify, and just use ints 
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
> 
> Could we see the whole fragment as you originally had it, but with the 
> optimizations you've done so far?  The more I look at that determine() 
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.

Here's my original code:

# record time for each message matching the specified message for each tool 
messageTimes = {} 
for row in cdata:   # tool, time, message 
    if self.message in row[2]: 
        messageTimes[row[0], row[1]] = 1 

# now pull out each message that is within the time diff for each matched message 
# as well as the matched messages themselves 

def determine(tup): 
    if self.message in tup[2]: return True      # matched message 

    for (tool, date_time) in messageTimes: 
        if tool == tup[0]: 
            if abs(date_time-tup[1]) <= tdiff: 
               return True 

    return False 
        
cdata[:] = [tup for tup in cdata if determine(tup)] 

Here's the code now:


       # Parse data and make a list of the time for each message matching the specified message for each tool
        messageTimes = defaultdict(list)    # a dict with sorted lists

        for row in cdata:   # tool, time, message
            if self.message in row[2]:
                messageTimes[row[0]].append(row[1])

        # now pull out each message that is within the time context for each matched message
        # as well as the matched messages themselves

        # return true is we should keep this message
        def determine(tup):
            if self.message in tup[2]: return True     # matched message
            if seconds == 0: return False                # no time context specified

            times = messageTimes[tup[0]]              # get the list of matched messages for this tool
           
            le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1]
            ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1]
            return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

        cdata = [tup for tup in cdata if determine(tup)]

In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. 

I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 

I hope I've explained this better. Thanks again. 

[toc] | [prev] | [next] | [standalone]


#35339

FromDave Angel <d@davea.name>
Date2012-12-22 01:47 -0500
Message-ID<mailman.1185.1356158853.29569.python-list@python.org>
In reply to#35337
On 12/21/2012 11:47 PM, Larry.Martell@gmail.com wrote:
> On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
>> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
>>
>>>> <snip>
> I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
>
> cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
>
> Here's my original code:
>
> # record time for each message matching the specified message for each tool 
> messageTimes = {} 
> for row in cdata:   # tool, time, message 
>     if self.message in row[2]: 
>         messageTimes[row[0], row[1]] = 1 
>
> # now pull out each message that is within the time diff for each matched message 
> # as well as the matched messages themselves 
>
> def determine(tup): 
>     if self.message in tup[2]: return True      # matched message 
>
>     for (tool, date_time) in messageTimes: 
>         if tool == tup[0]: 
>             if abs(date_time-tup[1]) <= tdiff: 
>                return True 
>
>     return False 
>         
> cdata[:] = [tup for tup in cdata if determine(tup)] 
>
> Here's the code now:
>
>
>        # Parse data and make a list of the time for each message matching the specified message for each tool
>         messageTimes = defaultdict(list)    # a dict with sorted lists
>
>         for row in cdata:   # tool, time, message
>             if self.message in row[2]:
>                 messageTimes[row[0]].append(row[1])
>
>         # now pull out each message that is within the time context for each matched message
>         # as well as the matched messages themselves
>
>         # return true is we should keep this message
>         def determine(tup):
>             if self.message in tup[2]: return True     # matched message
>             if seconds == 0: return False                # no time context specified
>
>             times = messageTimes[tup[0]]              # get the list of matched messages for this tool
>            
>             le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1]
>             ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1]
>             return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
>
>         cdata = [tup for tup in cdata if determine(tup)]
>
> In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. 
>
> I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 
>
> I hope I've explained this better. Thanks again. 

That is better, and the point I missed noticing before is that
messageTimes is substantially smaller than cdata;  it's already been
filtered down by looking for self.message in its row[2].  The code was
there, but I didn't relate.  Remember I was bothered that you didn't
look at tup[2] when you were comparing for time-similarity.  Well, you
did that implicitly, since messageTimes was already filtered.  Sorry
about that.

That also lowers my expectations for improvement ratio, since instead of
600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
now my expectations are only 4:1 to 10:1.

Still, there's room for improvement.  (1) You should only need one
bisect in determine, and (2) if you remember the last result for each
tool, you could speed that one up some.

(1) Instead of getting both and le and a ge, get just one, by searching
for tup[1]-tdiff.  Then by comparing that row's value against
tup[1]+tdiff, you can return immediately the boolean, the expression
being about half of the one you've now got.

(2) Make a dict of ints, keyed by the tool, and initialized to zero.
Call that dict "found."  Each time you do a bisect, specify a range
starting at found[tool].  Similarly, store the result of the bisect in
found[tool].  This should gradually restrict the range searched by
bisect, which COULD save some time. It works because everything's sorted.

Naturally, make these changes independently, and time the changes.  In
particular, it's possible that #2 will degrade performance instead of
improving it.  But #1 should double performance, pretty close.

I hope these were clear enough.  I don't want to write the changes
myself, since with no test data, I could easily make a mess of it.
....

I can think of other changes which are less likely to make substantial
improvement, and which might degrade things.  For one drastic example,
instead of any messageTimes storage, you could have ("flags") a list of
bool, same size as ctimes, which tracked the state of each line.  You
loop through cdata, identifying and marking any line whose tup[2]
matches.  And for each match, you scan backwards and forwards till the
time gets out of range (in one direction stopping at time-tdiff or 0 or
tool change, and in the other direction at time+tdiff or len, or
toolchange), marking each one within tdiff.

Then after one pass through cdata, you can have a very simple list
comprehension, something like:

cdata = [tup for index, tup in enumerate(cdata) if flags[index]]

Will it be faster?  Depends on the number of expected matches (eg.
30,000 out of 300,000 is 10%), and how much data forward and backwards
you need to scan.
....

A very simple difference?  Perhaps using implicit unpacking instead of
using tup[0] etc. will be faster.  It'll certainly be easier to read.

for tool, time, text in cdata:
   if self.message in text:
       messageTimes[tool]. append(time)

def determine(tool, time, text):

and call it with    determine(*tup)


-- 

DaveA

[toc] | [prev] | [next] | [standalone]


#35475

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-24 09:57 -0800
Message-ID<9ca91fe0-006c-4ad8-8a3c-0f4d5c10b32b@googlegroups.com>
In reply to#35339
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 11:47 PM, Larry.Martell@gmail.com wrote: 
> > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> >> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> >>
> >>>> <snip 
> > I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. 
> >
> > cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
> > 
> > Here's my original code: 
> >
> > # record time for each message matching the specified message for each tool 
> > messageTimes = {} 
> > for row in cdata:   # tool, time, message  
> >     if self.message in row[2]:  
> >         messageTimes[row[0], row[1]] = 1 
> > 
> > # now pull out each message that is within the time diff for each matched message  
> > # as well as the matched messages themselves  
> >
> > def determine(tup):  
> >     if self.message in tup[2]: return True      # matched message  
> > 
> >     for (tool, date_time) in messageTimes:  
> >         if tool == tup[0]:  
> >             if abs(date_time-tup[1]) <= tdiff:  
> >                return True 
> > 
> >     return False  
> >          
> > cdata[:] = [tup for tup in cdata if determine(tup)] 
> >
> > Here's the code now: 
> > 
> > 
> >        # Parse data and make a list of the time for each message matching the specified message for each tool 
> >         messageTimes = defaultdict(list)    # a dict with sorted lists 
> > 
> >         for row in cdata:   # tool, time, message 
> >             if self.message in row[2]: 
> >                 messageTimes[row[0]].append(row[1]) 
> > 
> >         # now pull out each message that is within the time context for each matched message 
> >         # as well as the matched messages themselves 
> > 
> >         # return true is we should keep this message 
> >         def determine(tup): 
> >             if self.message in tup[2]: return True     # matched message
> >             if seconds == 0: return False                # no time context specified 
> > 
> >             times = messageTimes[tup[0]]              # get the list of matched messages for this tool 
> >             
> >             le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1] 
> >             ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1] 
> >             return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) 
> > 
> >         cdata = [tup for tup in cdata if determine(tup)] 
> > 
> > In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.  
> >
> > I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 
> > 
> > I hope I've explained this better. Thanks again.   
> 
> That is better, and the point I missed noticing before is that 
> messageTimes is substantially smaller than cdata;  it's already been 
> filtered down by looking for self.message in its row[2].  The code was 
> there, but I didn't relate.  Remember I was bothered that you didn't 
> look at tup[2] when you were comparing for time-similarity.  Well, you 
> did that implicitly, since messageTimes was already filtered.  Sorry 
> about that. 
> 
> That also lowers my expectations for improvement ratio, since instead of 
> 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
> now my expectations are only 4:1 to 10:1. 
>
> Still, there's room for improvement.  (1) You should only need one
> bisect in determine, and (2) if you remember the last result for each 
> tool, you could speed that one up some. 
>
> (1) Instead of getting both and le and a ge, get just one, by searching 
> for tup[1]-tdiff.  Then by comparing that row's value against 
> tup[1]+tdiff, you can return immediately the boolean, the expression 
> being about half of the one you've now got.

Dave, I cannot thank you enough. With this change it went from 20 minutes to 10.

> (2) Make a dict of ints, keyed by the tool, and initialized to zero. 
> Call that dict "found."  Each time you do a bisect, specify a range 
> starting at found[tool].  Similarly, store the result of the bisect in
> found[tool].  This should gradually restrict the range searched by 
> bisect, which COULD save some time. It works because everything's sorted.

And with this change it took 3 minutes. WOW! 

Merry Christmas and Happy New Year!!!

 
> Naturally, make these changes independently, and time the changes.  In
> 
> particular, it's possible that #2 will degrade performance instead of
> 
> improving it.  But #1 should double performance, pretty close.
> 
> 
> 
> I hope these were clear enough.  I don't want to write the changes
> 
> myself, since with no test data, I could easily make a mess of it.
> 
> ....
> 
> 
> 
> I can think of other changes which are less likely to make substantial
> 
> improvement, and which might degrade things.  For one drastic example,
> 
> instead of any messageTimes storage, you could have ("flags") a list of
> 
> bool, same size as ctimes, which tracked the state of each line.  You
> 
> loop through cdata, identifying and marking any line whose tup[2]
> 
> matches.  And for each match, you scan backwards and forwards till the
> 
> time gets out of range (in one direction stopping at time-tdiff or 0 or
> 
> tool change, and in the other direction at time+tdiff or len, or
> 
> toolchange), marking each one within tdiff.
> 
> 
> 
> Then after one pass through cdata, you can have a very simple list
> 
> comprehension, something like:
> 
> 
> 
> cdata = [tup for index, tup in enumerate(cdata) if flags[index]]
> 
> 
> 
> Will it be faster?  Depends on the number of expected matches (eg.
> 
> 30,000 out of 300,000 is 10%), and how much data forward and backwards
> 
> you need to scan.
> 
> ....
> 
> 
> 
> A very simple difference?  Perhaps using implicit unpacking instead of
> 
> using tup[0] etc. will be faster.  It'll certainly be easier to read.
> 
> 
> 
> for tool, time, text in cdata:
> 
>    if self.message in text:
> 
>        messageTimes[tool]. append(time)
> 
> 
> 
> def determine(tool, time, text):
> 
> 
> 
> and call it with    determine(*tup)

[toc] | [prev] | [next] | [standalone]


#35476

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-24 09:57 -0800
Message-ID<mailman.1261.1356371863.29569.python-list@python.org>
In reply to#35339
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 11:47 PM, Larry.Martell@gmail.com wrote: 
> > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> >> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> >>
> >>>> <snip 
> > I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. 
> >
> > cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
> > 
> > Here's my original code: 
> >
> > # record time for each message matching the specified message for each tool 
> > messageTimes = {} 
> > for row in cdata:   # tool, time, message  
> >     if self.message in row[2]:  
> >         messageTimes[row[0], row[1]] = 1 
> > 
> > # now pull out each message that is within the time diff for each matched message  
> > # as well as the matched messages themselves  
> >
> > def determine(tup):  
> >     if self.message in tup[2]: return True      # matched message  
> > 
> >     for (tool, date_time) in messageTimes:  
> >         if tool == tup[0]:  
> >             if abs(date_time-tup[1]) <= tdiff:  
> >                return True 
> > 
> >     return False  
> >          
> > cdata[:] = [tup for tup in cdata if determine(tup)] 
> >
> > Here's the code now: 
> > 
> > 
> >        # Parse data and make a list of the time for each message matching the specified message for each tool 
> >         messageTimes = defaultdict(list)    # a dict with sorted lists 
> > 
> >         for row in cdata:   # tool, time, message 
> >             if self.message in row[2]: 
> >                 messageTimes[row[0]].append(row[1]) 
> > 
> >         # now pull out each message that is within the time context for each matched message 
> >         # as well as the matched messages themselves 
> > 
> >         # return true is we should keep this message 
> >         def determine(tup): 
> >             if self.message in tup[2]: return True     # matched message
> >             if seconds == 0: return False                # no time context specified 
> > 
> >             times = messageTimes[tup[0]]              # get the list of matched messages for this tool 
> >             
> >             le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1] 
> >             ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1] 
> >             return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) 
> > 
> >         cdata = [tup for tup in cdata if determine(tup)] 
> > 
> > In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.  
> >
> > I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 
> > 
> > I hope I've explained this better. Thanks again.   
> 
> That is better, and the point I missed noticing before is that 
> messageTimes is substantially smaller than cdata;  it's already been 
> filtered down by looking for self.message in its row[2].  The code was 
> there, but I didn't relate.  Remember I was bothered that you didn't 
> look at tup[2] when you were comparing for time-similarity.  Well, you 
> did that implicitly, since messageTimes was already filtered.  Sorry 
> about that. 
> 
> That also lowers my expectations for improvement ratio, since instead of 
> 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
> now my expectations are only 4:1 to 10:1. 
>
> Still, there's room for improvement.  (1) You should only need one
> bisect in determine, and (2) if you remember the last result for each 
> tool, you could speed that one up some. 
>
> (1) Instead of getting both and le and a ge, get just one, by searching 
> for tup[1]-tdiff.  Then by comparing that row's value against 
> tup[1]+tdiff, you can return immediately the boolean, the expression 
> being about half of the one you've now got.

Dave, I cannot thank you enough. With this change it went from 20 minutes to 10.

> (2) Make a dict of ints, keyed by the tool, and initialized to zero. 
> Call that dict "found."  Each time you do a bisect, specify a range 
> starting at found[tool].  Similarly, store the result of the bisect in
> found[tool].  This should gradually restrict the range searched by 
> bisect, which COULD save some time. It works because everything's sorted.

And with this change it took 3 minutes. WOW! 

Merry Christmas and Happy New Year!!!

 
> Naturally, make these changes independently, and time the changes.  In
> 
> particular, it's possible that #2 will degrade performance instead of
> 
> improving it.  But #1 should double performance, pretty close.
> 
> 
> 
> I hope these were clear enough.  I don't want to write the changes
> 
> myself, since with no test data, I could easily make a mess of it.
> 
> ....
> 
> 
> 
> I can think of other changes which are less likely to make substantial
> 
> improvement, and which might degrade things.  For one drastic example,
> 
> instead of any messageTimes storage, you could have ("flags") a list of
> 
> bool, same size as ctimes, which tracked the state of each line.  You
> 
> loop through cdata, identifying and marking any line whose tup[2]
> 
> matches.  And for each match, you scan backwards and forwards till the
> 
> time gets out of range (in one direction stopping at time-tdiff or 0 or
> 
> tool change, and in the other direction at time+tdiff or len, or
> 
> toolchange), marking each one within tdiff.
> 
> 
> 
> Then after one pass through cdata, you can have a very simple list
> 
> comprehension, something like:
> 
> 
> 
> cdata = [tup for index, tup in enumerate(cdata) if flags[index]]
> 
> 
> 
> Will it be faster?  Depends on the number of expected matches (eg.
> 
> 30,000 out of 300,000 is 10%), and how much data forward and backwards
> 
> you need to scan.
> 
> ....
> 
> 
> 
> A very simple difference?  Perhaps using implicit unpacking instead of
> 
> using tup[0] etc. will be faster.  It'll certainly be easier to read.
> 
> 
> 
> for tool, time, text in cdata:
> 
>    if self.message in text:
> 
>        messageTimes[tool]. append(time)
> 
> 
> 
> def determine(tool, time, text):
> 
> 
> 
> and call it with    determine(*tup)

[toc] | [prev] | [next] | [standalone]


#35338

From"Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
Date2012-12-21 20:47 -0800
Message-ID<mailman.1184.1356151656.29569.python-list@python.org>
In reply to#35336
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
> 
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> 
> >> <snip>
> 
> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> 
> > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
> 
> >
> 
> > Thanks very much for all your help. 
> 
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold 
> improvement.  Assuming we're talking just about time spent in the cdata 
>  = list-comprehension list.
> 
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
> 
> >> This was the code I had:
> >> 
> >> times = messageTimes[tup[0]]
> 
> >>
> 
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure 
> this is equivalent.  And I wonder if even looking up tup[1] is 
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
>  
> I wish you could post some sample data, and identify which records are 
> intended to be True.  Obviously you could simplify, and just use ints 
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
> 
> Could we see the whole fragment as you originally had it, but with the 
> optimizations you've done so far?  The more I look at that determine() 
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.

Here's my original code:

# record time for each message matching the specified message for each tool 
messageTimes = {} 
for row in cdata:   # tool, time, message 
    if self.message in row[2]: 
        messageTimes[row[0], row[1]] = 1 

# now pull out each message that is within the time diff for each matched message 
# as well as the matched messages themselves 

def determine(tup): 
    if self.message in tup[2]: return True      # matched message 

    for (tool, date_time) in messageTimes: 
        if tool == tup[0]: 
            if abs(date_time-tup[1]) <= tdiff: 
               return True 

    return False 
        
cdata[:] = [tup for tup in cdata if determine(tup)] 

Here's the code now:


       # Parse data and make a list of the time for each message matching the specified message for each tool
        messageTimes = defaultdict(list)    # a dict with sorted lists

        for row in cdata:   # tool, time, message
            if self.message in row[2]:
                messageTimes[row[0]].append(row[1])

        # now pull out each message that is within the time context for each matched message
        # as well as the matched messages themselves

        # return true is we should keep this message
        def determine(tup):
            if self.message in tup[2]: return True     # matched message
            if seconds == 0: return False                # no time context specified

            times = messageTimes[tup[0]]              # get the list of matched messages for this tool
           
            le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1]
            ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1]
            return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

        cdata = [tup for tup in cdata if determine(tup)]

In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. 

I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 

I hope I've explained this better. Thanks again. 

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web