Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Fri, 21 Dec 2012 12:36:16 -0800 (PST)
In-Reply-To: <mailman.1163.1356112649.29569.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=68.84.146.219; posting-account=aFD2wgkAAACT3OnBYoNKQGBzyOZ_PB2h
References: <bd5ae6b7-2440-42e4-a93c-eb877feebcfe@googlegroups.com> <mailman.1126.1356052648.29569.python-list@python.org> <d6aaa5b5-7d21-4018-ba9a-ea354b15b6c5@googlegroups.com> <mailman.1134.1356060700.29569.python-list@python.org> <mailman.1163.1356112649.29569.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: help with making my code more efficient
From: "Larry.Martell@gmail.com" <Larry.Martell@gmail.com>
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org, d@davea.name
Precedence: list
Message-ID: <mailman.1166.1356122185.29569.python-list@python.org>
Lines: 180
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:35316

On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> > On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:=20
> > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:=20
> > >> <snip>
=20
> > > Of course it's a fragment - it's part of a large program and I was ju=
st showing the relevant parts.=20
=20
> > But it seems these are methods in a class, or something, so we're=20
> > missing context.  And you use self without it being an argument to the=
=20
> > function.  Like it's a global.

> I didn't show the entire method, only what I thought was relevant to my i=
ssue. The method is declared as:
> =20
>     def generate_data(self):

> > > <snip>
=20
> > > Yes, the code works. I end up with just the rows I want.
 =20
> > >> Are you only concerned about speed, not fixing features? =20
=20
> > > Don't know what you mean by 'fixing features'. The code does what I w=
ant, it just takes too long.
=20
> > >> As far as I can tell, the logic that includes the time comparison is=
 bogus. =20
=20
> > > Not at all.=20

> > >> You don't do  anything there to worry about the value of tup[2], jus=
t whether some
> > >> item has a nearby time.  Of course, I could misunderstand the spec.

> > > The data comes from a database. tup[2] is a datetime column. tdiff co=
mes from a datetime.timedelta()=20
=20
> > I thought that tup[1] was the datetime.  In any case, the loop makes no
> > sense to me, so I can't really optimize it, just make suggestions.
=20
> Yes, tup[1] is the datetime. I mistyped last night.=20

> > >> Are you making a global called 'self' ?  That name is by convention =
only
> > >> used in methods to designate the instance object.  What's the attrib=
ute
> > >> self?

> > > Yes, self is my instance object. self.message contains the string of =
interest that I need to look for.=20

> > >> Can cdata have duplicates, and are they significant?=20

> > > No, it will not have duplicates.

> > >> Is the list sorted in any way?
=20
> > > Yes, the list is sorted by tool and datetime.

> > >> Chances are your performance bottleneck is the doubly-nested loop.  =
You

> > >> have a list comprehension at top-level code, and inside it calls a

> > >> function that also loops over the 600,000 items.  So the inner loop =
gets

> > >> executed 360 billion times.  You can cut this down drastically by so=
me

> > >> judicious sorting, as well as by having a map of lists, where the ma=
p is
> > >> keyed by the tool.

> > > Thanks. I will try that.

> > So in your first loop, you could simply split the list into separate=20
> > lists, one per tup[0] value, and the lists as dictionary items, keyed b=
y
> > that tool string.

> > Then inside the determine() function, make a local ref to the particula=
r=20
> > list for the tool.
> >    recs =3D messageTimes[tup[0]]

> I made that change ant went from taking over 2 hours to 54 minutes. A dra=
matic improvement, but still not adequate for my app.=20

> > Instead of a for loop over recs, use a binary search to identify the=20
> > first item that's >=3D date_time-tdiff.  Then if it's less than=20
> > date_time+tdiff, return True, otherwise False.  Check out the bisect=20
> > module.  Function bisect_left() should do what you want in a sorted lis=
t.

> Didn't know about bisect. Thanks. I thought it would be my savior for sur=
e. But unfortunaly when I added that, it blows up with out of memory.=20

The out of memory error had nothing to do with using bisect. I had introduc=
ed a typo that I really though would have caused a variable referenced befo=
re assignment error. But it did not do that, and instead somehow caused all=
 the memory in my machine to get used up. When I fixed that, it worked real=
ly well with bisect. The code that was taking 2 hours was down to 20 minute=
s, and even better, a query that was taking 40 minutes was down to 8 second=
s.=20

Thanks very much for all your help.=20




> This was the code I had:
> =20
> times =3D messageTimes[tup[0]]
>=20
> le =3D bisect.bisect_right(times, tup[1])
>=20
> ge =3D bisect.bisect_left(times, tup[1])
>=20
> return (le and tup[1]-times[le-1] <=3D tdiff) or (ge !=3D len(times) and =
times[ge]-tup[1] <=3D tdiff)
>=20
>=20
>=20
> > >>> cdata[:] =3D [tup for tup in cdata if determine(tup)]
>=20
> >=20
>=20
> > >> As the code exists, there's no need to copy the list.  Just do a sim=
ple=20
>=20
> > >> bind.
>=20
> >=20
>=20
> > > This statement is to remove the items from cdata that I don't want. I=
 don't know what you mean by bind. I'm not familiar with that python functi=
on.=20
>=20
> >=20
>=20
> > Every "assignment" to a simple name is really a rebinding of that name.
>=20
> > cdata =3D [tup for tup in cdata if determine(tup)]
>=20
> >=20
>=20
> > will rebind the name to the new object, much quicker than copying.  If
>=20
> > this is indeed a top-level line, it should be equivalent.  But if in
>=20
> > fact this is inside some other function, it may violate some other
>=20
> > assumptions.  In particular, if there are other names for the same
>=20
> > object, then you're probably stuck with modifying it in place, using
>=20
> > slice notation.
>=20
>=20
>=20
> The slice notation  was left over when when cdata was a tuple. Now that i=
t's a list I don't need that any more.=20
>=20
>=20
>=20
> > BTW, a set is generally much more memory efficient than a dict, when yo=
u
>=20
> > don't use the "value".  But since I think you'll be better off with a
>=20
> > dict of lists, it's a moot point.
>=20
>=20
>=20
> I'm going back to square 1 and try and do all from SQL.