Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Thu, 20 Dec 2012 17:46:38 -0800 (PST)
In-Reply-To: <mailman.1126.1356052648.29569.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h
References: <bd5ae6b7-2440-42e4-a93c-eb877feebcfe@googlegroups.com> <mailman.1126.1356052648.29569.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: help with making my code more efficient
From: "Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org, d@davea.name
Precedence: list
Message-ID: <mailman.1128.1356054403.29569.python-list@python.org>
Lines: 148
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35257

On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>=20
> > I have a list of tuples that contains a tool_id, a time, and a message.=
 I want to select from this list all the elements where the message matches=
 some string, and all the other elements where the time is within some diff=
 of any matching message for that tool.=20
>=20
> > Here is how I am currently doing this:
>=20
> No, it's not.  This is a fragment of code, without enough clues as to
>=20
> what else is going.  We can guess, but that's likely to make a mess.

Of course it's a fragment - it's part of a large program and I was just sho=
wing the relevant parts.=20

> First question is whether this code works exactly correctly? =20

Yes, the code works. I end up with just the rows I want.=20

> Are you only concerned about speed, not fixing features? =20

Don't know what you mean by 'fixing features'. The code does what I want, i=
t just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogu=
s. =20

Not at all.=20

> You don't do  anything there to worry about the value of tup[2], just whe=
ther some
> item has a nearby time.  Of course, I could misunderstand the spec.

The data comes from a database. tup[2] is a datetime column. tdiff comes fr=
om a datetime.timedelta()=20

> Are you making a global called 'self' ?  That name is by convention only
> used in methods to designate the instance object.  What's the attribute
> self?

Yes, self is my instance object. self.message contains the string of intere=
st that I need to look for.=20

> Can cdata have duplicates, and are they significant?=20

No, it will not have duplicates.

> Are you including  the time building that as part of your 2 hour measurem=
ent?=20

No, the 2 hours is just the time to run the=20

cdata[:] =3D [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?

Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop.  You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items.  So the inner loop gets
> executed 360 billion times.  You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.

Thanks. I will try that.

> > # record time for each message matching the specified message for each =
tool
>=20
> > messageTimes =3D {}
>
> You're building a dictionary;  are you actually using the value (1), or
>  is only the key relevant? =20

Only the keys.

> A set is a dict without a value. =20

Yes, I could use a set, but I don't think that would make it measurably fas=
ter.=20

> But more mportantly, you never look up anything in this dictionary.  So w=
hy
> isn't it a list?  For that matter, why don't you just use the
> messageTimes list?

Yes, it could be a list too.
=20
> > for row in cdata:   # tool, time, message
>=20
> >     if self.message in row[2]:
>=20
> >         messageTimes[row[0], row[1]] =3D 1
>=20
> >
>=20
> > # now pull out each message that is within the time diff for each match=
ed message
>=20
> > # as well as the matched messages themselves
>=20
> >
>=20
> > def determine(tup):
>=20
> >     if self.message in tup[2]: return True      # matched message=20
>=20
> >
>=20
> >     for (tool, date_time) in messageTimes:
>=20
> >         if tool =3D=3D tup[0]:
>=20
> >             if abs(date_time-tup[1]) <=3D tdiff:=20
>=20
> >                return True
>=20
> >
>=20
> >     return False
>=20
> >        =20
>=20
> > cdata[:] =3D [tup for tup in cdata if determine(tup)]
>=20
>=20
>=20
> As the code exists, there's no need to copy the list.  Just do a simple
> bind.

This statement is to remove the items from cdata that I don't want. I don't=
 know what you mean by bind. I'm not familiar with that python function.=20

>=20
>=20
>=20
> >
>=20
> > This code works, but it takes way too long to run - e.g. when cdata has=
 600,000 elements (which is typical for my app) it takes 2 hours for this t=
o run.=20
>=20
> >
>=20
> > Can anyone give me some suggestions on speeding this up?