Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Fri, 21 Dec 2012 20:47:28 -0800 (PST)
In-Reply-To: <mailman.1183.1356146412.29569.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h
References: <bd5ae6b7-2440-42e4-a93c-eb877feebcfe@googlegroups.com> <mailman.1126.1356052648.29569.python-list@python.org> <d6aaa5b5-7d21-4018-ba9a-ea354b15b6c5@googlegroups.com> <mailman.1134.1356060700.29569.python-list@python.org> <mailman.1163.1356112649.29569.python-list@python.org> <b334fb0c-0e76-400d-90e4-1ecde1cc2090@googlegroups.com> <mailman.1183.1356146412.29569.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: help with making my code more efficient
From: "Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org, d@davea.name
Precedence: list
Message-ID: <mailman.1184.1356151656.29569.python-list@python.org>
Lines: 152
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35338

On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell@gmail.com wrote:
>=20
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wro=
te:
>=20
> >> <snip>
>=20
> >> Didn't know about bisect. Thanks. I thought it would be my savior for =
sure. But unfortunaly when I added that, it blows up with out of memory.=20
>=20
> > The out of memory error had nothing to do with using bisect. I had intr=
oduced a typo that I really though would have caused a variable referenced =
before assignment error. But it did not do that, and instead somehow caused=
 all the memory in my machine to get used up. When I fixed that, it worked =
really well with bisect. The code that was taking 2 hours was down to 20 mi=
nutes, and even better, a query that was taking 40 minutes was down to 8 se=
conds.=20
>=20
> >
>=20
> > Thanks very much for all your help.=20
>=20
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold=20
> improvement.  Assuming we're talking just about time spent in the cdata=
=20
>  =3D list-comprehension list.
>=20
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
>=20
> >> This was the code I had:
> >>=20
> >> times =3D messageTimes[tup[0]]
>=20
> >>
>=20
> >> le =3D bisect.bisect_right(times, tup[1])
> >> ge =3D bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <=3D tdiff) or (ge !=3D len(times) a=
nd times[ge]-tup[1] <=3D tdiff)
>=20
>=20
>=20
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure=20
> this is equivalent.  And I wonder if even looking up tup[1] is=20
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
> =20
> I wish you could post some sample data, and identify which records are=20
> intended to be True.  Obviously you could simplify, and just use ints=20
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
>=20
> Could we see the whole fragment as you originally had it, but with the=20
> optimizations you've done so far?  The more I look at that determine()=20
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows fro=
m the database with tool, time, and message. The user has specified a strin=
g and a time threshold. From that set of rows I need to return all the rows=
 that contain the user's string and all the other rows that match the tool =
from the matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messa=
ges, keyed by tool. In determine() I don't look though cdata - it gets one =
element from cdata and I see if that should be selected because it either m=
atches the user's message, or it's within the threshold of one that did mat=
ch.

Here's my original code:

# record time for each message matching the specified message for each tool=
=20
messageTimes =3D {}=20
for row in cdata:   # tool, time, message=20
    if self.message in row[2]:=20
        messageTimes[row[0], row[1]] =3D 1=20

# now pull out each message that is within the time diff for each matched m=
essage=20
# as well as the matched messages themselves=20

def determine(tup):=20
    if self.message in tup[2]: return True      # matched message=20

    for (tool, date_time) in messageTimes:=20
        if tool =3D=3D tup[0]:=20
            if abs(date_time-tup[1]) <=3D tdiff:=20
               return True=20

    return False=20
       =20
cdata[:] =3D [tup for tup in cdata if determine(tup)]=20

Here's the code now:


       # Parse data and make a list of the time for each message matching t=
he specified message for each tool
        messageTimes =3D defaultdict(list)    # a dict with sorted lists

        for row in cdata:   # tool, time, message
            if self.message in row[2]:
                messageTimes[row[0]].append(row[1])

        # now pull out each message that is within the time context for eac=
h matched message
        # as well as the matched messages themselves

        # return true is we should keep this message
        def determine(tup):
            if self.message in tup[2]: return True     # matched message
            if seconds =3D=3D 0: return False                # no time cont=
ext specified

            times =3D messageTimes[tup[0]]              # get the list of m=
atched messages for this tool
          =20
            le =3D bisect.bisect_right(times, tup[1])   # find time less th=
an or equal to tup[1]
            ge =3D bisect.bisect_left(times, tup[1])    # find time greater=
 then to equal to tup[1]
            return (le and tup[1]-times[le-1] <=3D tdiff) or (ge !=3D len(t=
imes) and times[ge]-tup[1] <=3D tdiff)

        cdata =3D [tup for tup in cdata if determine(tup)]

In my test case, cdata started with 600k rows, 30k matched the users string=
, and a total of 110k needed to be returned (which is what cdata ended up w=
ith) - the 30k that matched the string, and 80k that were within the time t=
hreshold.=20

I think the point you may have missed is the tool - I only return a row if =
it's the same tool as a matched message and within the threshold.=20

I hope I've explained this better. Thanks again.