Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #22485 > unrolled thread
| Started by | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| First post | 2012-03-28 11:39 -0700 |
| Last post | 2012-04-03 21:45 -0400 |
| Articles | 7 — 6 participants |
Back to article view | Back to comp.lang.python
Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 11:39 -0700
Re: Best way to structure data for efficient searching "Larry.Martell@gmail.com" <larry.martell@gmail.com> - 2012-03-28 13:05 -0700
Re: Best way to structure data for efficient searching Asen Bozhilov <asen.bozhilov@gmail.com> - 2012-03-29 07:04 -0700
Re: Best way to structure data for efficient searching Jon Clements <joncle@googlemail.com> - 2012-03-28 12:52 -0700
Re: Best way to structure data for efficient searching Peter Otten <__peter__@web.de> - 2012-04-02 23:32 +0200
Re: Best way to structure data for efficient searching John Nagle <nagle@animats.com> - 2012-04-03 09:38 -0700
Re: Best way to structure data for efficient searching Roy Smith <roy@panix.com> - 2012-04-03 21:45 -0400
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-03-28 11:39 -0700 |
| Subject | Best way to structure data for efficient searching |
| Message-ID | <e1863f1f-f4f6-4821-8ffc-ee1ffea10416@k4g2000yqa.googlegroups.com> |
I have the following use case: I have a set of data that is contains 3 fields, K1, K2 and a timestamp. There are duplicates in the data set, and they all have to processed. Then I have another set of data with 4 fields: K3, K4, K5, and a timestamp. There are also duplicates in that data set, and they also all have to be processed. I need to find all the items in the second data set where K1==K3 and K2==K4 and the 2 timestamps are within 20 seconds of each other. I have this working, but the way I did it seems very inefficient - I simply put the data in 2 arrays (as tuples) and then walked through the entire second data set once for each item in the first data set, looking for matches. Is there a better, more efficient way I could have done this?
[toc] | [next] | [standalone]
| From | "Larry.Martell@gmail.com" <larry.martell@gmail.com> |
|---|---|
| Date | 2012-03-28 13:05 -0700 |
| Message-ID | <90d08585-361d-428e-a82d-c5005a6c4b5e@k24g2000yqe.googlegroups.com> |
| In reply to | #22485 |
On Mar 28, 1:52 pm, Jon Clements <jon...@googlemail.com> wrote: > On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry....@gmail.com wrote: > > I have the following use case: > > > I have a set of data that is contains 3 fields, K1, K2 and a > > timestamp. There are duplicates in the data set, and they all have to > > processed. > > > Then I have another set of data with 4 fields: K3, K4, K5, and a > > timestamp. There are also duplicates in that data set, and they also > > all have to be processed. > > > I need to find all the items in the second data set where K1==K3 and > > K2==K4 and the 2 timestamps are within 20 seconds of each other. > > > I have this working, but the way I did it seems very inefficient - I > > simply put the data in 2 arrays (as tuples) and then walked through > > the entire second data set once for each item in the first data set, > > looking for matches. > > > Is there a better, more efficient way I could have done this? > > It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested) > > select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20 This is part of django app, and the data came from mysql. Through a mechanism I can't change at this time (it would take a lot of code changes and this new functionality is needed ASAP) I get all the data at once and have to winnow it down. > Failing that, two (default)dicts with a tuple as the pair, then use that as your base. Since there are duplicates, I can't use a dict. And if I have any extraneous data in the keys (i.e. something to make them unique) then I still have to walk through the entire dict to find the matches.
[toc] | [prev] | [next] | [standalone]
| From | Asen Bozhilov <asen.bozhilov@gmail.com> |
|---|---|
| Date | 2012-03-29 07:04 -0700 |
| Message-ID | <a893aaf7-9554-4797-a337-a01e9dbf6b81@do4g2000vbb.googlegroups.com> |
| In reply to | #22486 |
Larry.Mart wrote:
> Since there are duplicates, I can't use a dict. And if I have any
> extraneous data in the keys (i.e. something to make them unique) then
> I still have to walk through the entire dict to find the matches.
You can use slightly different approach. With double mapping you could
simplify the lookup. What I mean?
Get the first set and build lookup map as:
MAP := {
KEY1-VALUE : {
KEY2-VALUE : [SET, SET, n-th duplicate SET]
}
}
In the worst case you would have quadratic complexity of the
algorithm. Otherwise the lookup would be really fast.
[toc] | [prev] | [next] | [standalone]
| From | Jon Clements <joncle@googlemail.com> |
|---|---|
| Date | 2012-03-28 12:52 -0700 |
| Message-ID | <26352843.1421.1332964332608.JavaMail.geo-discussion-forums@vbue17> |
| In reply to | #22485 |
On Wednesday, 28 March 2012 19:39:54 UTC+1, Larry....@gmail.com wrote: > I have the following use case: > > I have a set of data that is contains 3 fields, K1, K2 and a > timestamp. There are duplicates in the data set, and they all have to > processed. > > Then I have another set of data with 4 fields: K3, K4, K5, and a > timestamp. There are also duplicates in that data set, and they also > all have to be processed. > > I need to find all the items in the second data set where K1==K3 and > K2==K4 and the 2 timestamps are within 20 seconds of each other. > > I have this working, but the way I did it seems very inefficient - I > simply put the data in 2 arrays (as tuples) and then walked through > the entire second data set once for each item in the first data set, > looking for matches. > > Is there a better, more efficient way I could have done this? It might not be more *efficient* but others might find it more readable, and it'd be easier to change later. Try an in-memory SQL DB (such as sqlite3) and query as (untested) select t2.* from t1 join t2 on k1=k3 and k2=k4 where abs(t1.timestamp - t2.timestamp) < 20 Failing that, two (default)dicts with a tuple as the pair, then use that as your base. Jon.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-04-02 23:32 +0200 |
| Message-ID | <mailman.1235.1333402382.3037.python-list@python.org> |
| In reply to | #22485 |
Larry.Martell@gmail.com wrote:
> I have the following use case:
>
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
>
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
>
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
>
> Is there a better, more efficient way I could have done this?
Build a lookup table that maps (K3, K4) to the matching records in the
second table. Then you can walk through the first table, look up the
matching records in the second and filter by the timestamp constraint:
from collections import defaultdict, namedtuple
# simulate a database
One = namedtuple("One", "K1 K2 T")
Two = namedtuple("Two", "K3 K4 K5 T")
one = [
["alpha", "beta", 10],
["gamma", "delta", 20],
["gamma", "delta", 25],
["gamma", "delta", 40],
["kappa", "lambda", 40],
]
one = (One(*row) for row in one)
two = [
["alpha", "beta", "epsilon", 10],
["gamma", "delta", "zeta", 20],
["gamma", "delta", "eta", 60],
]
two = (Two(*row) for row in two)
# build the lookup table
lookup = defaultdict(list)
for rtwo in two:
lookup[rtwo.K3, rtwo.K4].append(rtwo)
# show the matches
for rone in one:
for rtwo in lookup.get((rone.K1, rone.K2), ()):
if abs(rone.T-rtwo.T) <= 20:
print rone, "-->", rtwo
(Personally I'd go for the SQL approach proposed by Jon; I find the argument
why you can't do it unconvincing)
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2012-04-03 09:38 -0700 |
| Message-ID | <jlf92p$8dv$1@dont-email.me> |
| In reply to | #22485 |
On 3/28/2012 11:39 AM, Larry.Martell@gmail.com wrote:
> I have the following use case:
>
> I have a set of data that is contains 3 fields, K1, K2 and a
> timestamp. There are duplicates in the data set, and they all have to
> processed.
>
> Then I have another set of data with 4 fields: K3, K4, K5, and a
> timestamp. There are also duplicates in that data set, and they also
> all have to be processed.
>
> I need to find all the items in the second data set where K1==K3 and
> K2==K4 and the 2 timestamps are within 20 seconds of each other.
>
> I have this working, but the way I did it seems very inefficient - I
> simply put the data in 2 arrays (as tuples) and then walked through
> the entire second data set once for each item in the first data set,
> looking for matches.
>
> Is there a better, more efficient way I could have done this?
How big are the data sets? Millions of entries? Billions?
Trillions? Will all the data fit in memory, or will this need
files or a database.
In-memory, it's not hard. First, decide which data set is smaller.
That one gets a dictionary keyed by K1 or K3, with each entry being
a list of tuples. Then go through the other data set linearly.
You can also sort one database by K1, the other by K3, and
match. Then take the matches, sort by K2 and K4, and match again.
Sort the remaining matches by timestamp and pull the ones within
the threshold.
Or you can load all the data into a database with a query
optimizer, like MySQL, and let it figure out, based on the
index sizes, how to do the join.
All of these approaches are roughly O(N log N), which
beats the O(N^2) approach you have now.
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-04-03 21:45 -0400 |
| Message-ID | <roy-E75C32.21451803042012@news.panix.com> |
| In reply to | #22617 |
> On 3/28/2012 11:39 AM, Larry.Martell@gmail.com wrote: > > I have a set of data that is contains 3 fields, K1, K2 and a > > timestamp. There are duplicates in the data set, and they all have to > > processed. > > > > Then I have another set of data with 4 fields: K3, K4, K5, and a > > timestamp. There are also duplicates in that data set, and they also > > all have to be processed. > > > > I need to find all the items in the second data set where K1==K3 and > > K2==K4 and the 2 timestamps are within 20 seconds of each other. In article <jlf92p$8dv$1@dont-email.me>, John Nagle <nagle@animats.com> wrote: > [some good ideas] > All of these approaches are roughly O(N log N), which > beats the O(N^2) approach you have now. If the timestamps are sparse enough, I can think of a way that's O(N), or pretty close to it.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web