Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #22542

Re: Best way to structure data for efficient searching

From Peter Otten <__peter__@web.de>
Subject Re: Best way to structure data for efficient searching
Date 2012-04-02 23:32 +0200
Organization None
References <e1863f1f-f4f6-4821-8ffc-ee1ffea10416@k4g2000yqa.googlegroups.com>
Newsgroups comp.lang.python
Message-ID <mailman.1235.1333402382.3037.python-list@python.org> (permalink)

Show all headers | View raw


Larry.Martell@gmail.com wrote:

> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

Build a lookup table that maps (K3, K4) to the matching records in the 
second table. Then you can walk through the first table, look up the 
matching records in the second and filter by the timestamp constraint:

from collections import defaultdict, namedtuple

# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
 ["alpha", "beta", 10],
 ["gamma", "delta", 20],
 ["gamma", "delta", 25],
 ["gamma", "delta", 40],
 ["kappa", "lambda", 40],
]
one = (One(*row) for row in one)

two = [
 ["alpha", "beta", "epsilon", 10],
 ["gamma", "delta", "zeta", 20],
 ["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)

# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
    lookup[rtwo.K3, rtwo.K4].append(rtwo)

# show the matches
for rone in one:
    for rtwo in lookup.get((rone.K1, rone.K2), ()):
        if abs(rone.T-rtwo.T) <= 20:
            print rone, "-->", rtwo

(Personally I'd go for the SQL approach proposed by Jon; I find the argument 
why you can't do it unconvincing)

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 11:39 -0700
  Re: Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 13:05 -0700
    Re: Best way to structure data for efficient searching Asen Bozhilov <asen.bozhilov@gmail.com> - 2012-03-29 07:04 -0700
  Re: Best way to structure data for efficient searching Jon Clements <joncle@googlemail.com> - 2012-03-28 12:52 -0700
  Re: Best way to structure data for efficient searching Peter Otten <__peter__@web.de> - 2012-04-02 23:32 +0200
  Re: Best way to structure data for efficient searching John Nagle <nagle@animats.com> - 2012-04-03 09:38 -0700
    Re: Best way to structure data for efficient searching Roy Smith <roy@panix.com> - 2012-04-03 21:45 -0400

csiph-web