Re: Best way to structure data for efficient searching

From	Peter Otten <__peter__@web.de>
Subject	Re: Best way to structure data for efficient searching
Date	2012-04-02 23:32 +0200
Organization	None
References	<e1863f1f-f4f6-4821-8ffc-ee1ffea10416@k4g2000yqa.googlegroups.com>
Newsgroups	comp.lang.python
Message-ID	<mailman.1235.1333402382.3037.python-list@python.org> (permalink)

Show all headers | View raw

Larry.Martell@gmail.com wrote:

> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

Build a lookup table that maps (K3, K4) to the matching records in the 
second table. Then you can walk through the first table, look up the 
matching records in the second and filter by the timestamp constraint:

from collections import defaultdict, namedtuple

# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
 ["alpha", "beta", 10],
 ["gamma", "delta", 20],
 ["gamma", "delta", 25],
 ["gamma", "delta", 40],
 ["kappa", "lambda", 40],
]
one = (One(*row) for row in one)

two = [
 ["alpha", "beta", "epsilon", 10],
 ["gamma", "delta", "zeta", 20],
 ["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)

# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
    lookup[rtwo.K3, rtwo.K4].append(rtwo)

# show the matches
for rone in one:
    for rtwo in lookup.get((rone.K1, rone.K2), ()):
        if abs(rone.T-rtwo.T) <= 20:
            print rone, "-->", rtwo

(Personally I'd go for the SQL approach proposed by Jon; I find the argument 
why you can't do it unconvincing)

Thread

Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 11:39 -0700
  Re: Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 13:05 -0700
    Re: Best way to structure data for efficient searching Asen Bozhilov <asen.bozhilov@gmail.com> - 2012-03-29 07:04 -0700
  Re: Best way to structure data for efficient searching Jon Clements <joncle@googlemail.com> - 2012-03-28 12:52 -0700
  Re: Best way to structure data for efficient searching Peter Otten <__peter__@web.de> - 2012-04-02 23:32 +0200
  Re: Best way to structure data for efficient searching John Nagle <nagle@animats.com> - 2012-04-03 09:38 -0700
    Re: Best way to structure data for efficient searching Roy Smith <roy@panix.com> - 2012-04-03 21:45 -0400

csiph-web