Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #22542
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Subject | Re: Best way to structure data for efficient searching |
| Date | 2012-04-02 23:32 +0200 |
| Organization | None |
| References | <e1863f1f-f4f6-4821-8ffc-ee1ffea10416@k4g2000yqa.googlegroups.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1235.1333402382.3037.python-list@python.org> (permalink) |
Larry.Martell@gmail.com wrote:
> I have the following use case:
>
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
>
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
>
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
>
> Is there a better, more efficient way I could have done this?
Build a lookup table that maps (K3, K4) to the matching records in the
second table. Then you can walk through the first table, look up the
matching records in the second and filter by the timestamp constraint:
from collections import defaultdict, namedtuple
# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
["alpha", "beta", 10],
["gamma", "delta", 20],
["gamma", "delta", 25],
["gamma", "delta", 40],
["kappa", "lambda", 40],
]
one = (One(*row) for row in one)
two = [
["alpha", "beta", "epsilon", 10],
["gamma", "delta", "zeta", 20],
["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)
# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
lookup[rtwo.K3, rtwo.K4].append(rtwo)
# show the matches
for rone in one:
for rtwo in lookup.get((rone.K1, rone.K2), ()):
if abs(rone.T-rtwo.T) <= 20:
print rone, "-->", rtwo
(Personally I'd go for the SQL approach proposed by Jon; I find the argument
why you can't do it unconvincing)
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 11:39 -0700
Re: Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 13:05 -0700
Re: Best way to structure data for efficient searching Asen Bozhilov <asen.bozhilov@gmail.com> - 2012-03-29 07:04 -0700
Re: Best way to structure data for efficient searching Jon Clements <joncle@googlemail.com> - 2012-03-28 12:52 -0700
Re: Best way to structure data for efficient searching Peter Otten <__peter__@web.de> - 2012-04-02 23:32 +0200
Re: Best way to structure data for efficient searching John Nagle <nagle@animats.com> - 2012-04-03 09:38 -0700
Re: Best way to structure data for efficient searching Roy Smith <roy@panix.com> - 2012-04-03 21:45 -0400
csiph-web