Groups > comp.lang.python > #22485 > unrolled thread

Best way to structure data for efficient searching

Started by	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
First post	2012-03-28 11:39 -0700
Last post	2012-04-03 21:45 -0400
Articles	7 — 6 participants

Back to article view | Back to comp.lang.python

  Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 11:39 -0700
    Re: Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 13:05 -0700
      Re: Best way to structure data for efficient searching Asen Bozhilov <asen.bozhilov@gmail.com> - 2012-03-29 07:04 -0700
    Re: Best way to structure data for efficient searching Jon Clements <joncle@googlemail.com> - 2012-03-28 12:52 -0700
    Re: Best way to structure data for efficient searching Peter Otten <__peter__@web.de> - 2012-04-02 23:32 +0200
    Re: Best way to structure data for efficient searching John Nagle <nagle@animats.com> - 2012-04-03 09:38 -0700
      Re: Best way to structure data for efficient searching Roy Smith <roy@panix.com> - 2012-04-03 21:45 -0400

#22485 — Best way to structure data for efficient searching

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-03-28 11:39 -0700
Subject	Best way to structure data for efficient searching
Message-ID	<e1863f1f-f4f6-4821-8ffc-ee1ffea10416@k4g2000yqa.googlegroups.com>

I have the following use case:

I have a set of data that is contains 3 fields, K1, K2 and a
timestamp. There are duplicates in the data set, and they all have to
processed.

Then I have another set of data with 4 fields: K3, K4, K5, and a
timestamp. There are also duplicates in that data set, and they also
all have to be processed.

I need to find all the items in the second data set where K1==K3 and
K2==K4 and the 2 timestamps are within 20 seconds of each other.

I have this working, but the way I did it seems very inefficient - I
simply put the data in 2 arrays (as tuples) and then walked through
the entire second data set once for each item in the first data set,
looking for matches.

Is there a better, more efficient way I could have done this?

[toc] | [next] | [standalone]

#22486

From	"Larry.Martell@gmail.com" <larry.martell@gmail.com>
Date	2012-03-28 13:05 -0700
Message-ID	<90d08585-361d-428e-a82d-c5005a6c4b5e@k24g2000yqe.googlegroups.com>
In reply to	#22485

On Mar 28, 1:52 pm, Jon Clements <jon...@googlemail.com> wrote:
> On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry....@gmail.com  wrote:
> > I have the following use case:
>
> > I have a set of data that is contains 3 fields, K1, K2 and a
> > timestamp. There are duplicates in the data set, and they all have to
> > processed.
>
> > Then I have another set of data with 4 fields: K3, K4, K5, and a
> > timestamp. There are also duplicates in that data set, and they also
> > all have to be processed.
>
> > I need to find all the items in the second data set where K1==K3 and
> > K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> > I have this working, but the way I did it seems very inefficient - I
> > simply put the data in 2 arrays (as tuples) and then walked through
> > the entire second data set once for each item in the first data set,
> > looking for matches.
>
> > Is there a better, more efficient way I could have done this?
>
> It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested)
>
> select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20

This is part of django app, and the data came from mysql. Through a
mechanism I can't change at this time (it would take a lot of code
changes and this new functionality is needed ASAP) I get all the data
at once and have to winnow it down.


> Failing that, two (default)dicts with a tuple as the pair, then use that as your base.

Since there are duplicates, I can't use a dict. And if I have any
extraneous data in the keys (i.e. something to make them unique) then
I still have to walk through the entire dict to find the matches.

[toc] | [prev] | [next] | [standalone]

#22494

From	Asen Bozhilov <asen.bozhilov@gmail.com>
Date	2012-03-29 07:04 -0700
Message-ID	<a893aaf7-9554-4797-a337-a01e9dbf6b81@do4g2000vbb.googlegroups.com>
In reply to	#22486

Larry.Mart wrote:

> Since there are duplicates, I can't use a dict. And if I have any
> extraneous data in the keys (i.e. something to make them unique) then
> I still have to walk through the entire dict to find the matches.

You can use slightly different approach. With double mapping you could
simplify the lookup. What I mean?
Get the first set and build lookup map as:

MAP := {
    KEY1-VALUE : {
        KEY2-VALUE : [SET, SET, n-th duplicate SET]
    }
}

In the worst case you would have quadratic complexity of the
algorithm. Otherwise the lookup would be really fast.

[toc] | [prev] | [next] | [standalone]

#22487

From	Jon Clements <joncle@googlemail.com>
Date	2012-03-28 12:52 -0700
Message-ID	<26352843.1421.1332964332608.JavaMail.geo-discussion-forums@vbue17>
In reply to	#22485

On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry....@gmail.com  wrote:
> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested)

select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20

Failing that, two (default)dicts with a tuple as the pair, then use that as your base.

Jon.

[toc] | [prev] | [next] | [standalone]

#22542

From	Peter Otten <__peter__@web.de>
Date	2012-04-02 23:32 +0200
Message-ID	<mailman.1235.1333402382.3037.python-list@python.org>
In reply to	#22485

Larry.Martell@gmail.com wrote:

> I have the following use case:
> 
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
> 
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
> 
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
> 
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
> 
> Is there a better, more efficient way I could have done this?

Build a lookup table that maps (K3, K4) to the matching records in the 
second table. Then you can walk through the first table, look up the 
matching records in the second and filter by the timestamp constraint:

from collections import defaultdict, namedtuple

# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
 ["alpha", "beta", 10],
 ["gamma", "delta", 20],
 ["gamma", "delta", 25],
 ["gamma", "delta", 40],
 ["kappa", "lambda", 40],
]
one = (One(*row) for row in one)

two = [
 ["alpha", "beta", "epsilon", 10],
 ["gamma", "delta", "zeta", 20],
 ["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)

# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
    lookup[rtwo.K3, rtwo.K4].append(rtwo)

# show the matches
for rone in one:
    for rtwo in lookup.get((rone.K1, rone.K2), ()):
        if abs(rone.T-rtwo.T) <= 20:
            print rone, "-->", rtwo

(Personally I'd go for the SQL approach proposed by Jon; I find the argument 
why you can't do it unconvincing)

[toc] | [prev] | [next] | [standalone]

#22617

From	John Nagle <nagle@animats.com>
Date	2012-04-03 09:38 -0700
Message-ID	<jlf92p$8dv$1@dont-email.me>
In reply to	#22485

On 3/28/2012 11:39 AM, Larry.Martell@gmail.com wrote:
> I have the following use case:
>
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
>
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
>
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
>
> Is there a better, more efficient way I could have done this?

    How big are the data sets?  Millions of entries?  Billions?
Trillions?  Will all the data fit in memory, or will this need
files or a database.

    In-memory, it's not hard.  First, decide which data set is smaller.
That one gets a dictionary keyed by K1 or K3, with each entry being
a list of tuples.  Then go through the other data set linearly.

    You can also sort one database by K1, the other by K3, and
match.  Then take the matches, sort by K2 and K4, and match again.
Sort the remaining matches by timestamp and pull the ones within
the threshold.

    Or you can load all the data into a database with a query
optimizer, like MySQL, and let it figure out, based on the
index sizes, how to do the join.

    All of these approaches are roughly O(N log N), which
beats the O(N^2) approach you have now.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#22647

From	Roy Smith <roy@panix.com>
Date	2012-04-03 21:45 -0400
Message-ID	<roy-E75C32.21451803042012@news.panix.com>
In reply to	#22617

> On 3/28/2012 11:39 AM, Larry.Martell@gmail.com wrote:
> > I have a set of data that is contains 3 fields, K1, K2 and a
> > timestamp. There are duplicates in the data set, and they all have to
> > processed.
> >
> > Then I have another set of data with 4 fields: K3, K4, K5, and a
> > timestamp. There are also duplicates in that data set, and they also
> > all have to be processed.
> >
> > I need to find all the items in the second data set where K1==K3 and
> > K2==K4 and the 2 timestamps are within 20 seconds of each other.

In article <jlf92p$8dv$1@dont-email.me>, John Nagle <nagle@animats.com> 
wrote:
> [some good ideas]
>    All of these approaches are roughly O(N log N), which
> beats the O(N^2) approach you have now.

If the timestamps are sparse enough, I can think of a way that's O(N), 
or pretty close to it.

[toc] | [prev] | [standalone]

csiph-web

Best way to structure data for efficient searching

Contents

#22485 — Best way to structure data for efficient searching

#22486

#22494

#22487

#22542

#22617

#22647