Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #63797 > unrolled thread

efficient way to process data

Started byLarry Martell <larry.martell@gmail.com>
First post2014-01-12 14:23 -0500
Last post2014-01-12 14:23 -0500
Articles 1 — 1 participant

Back to article view | Back to comp.lang.python


Contents

  efficient way to process data Larry Martell <larry.martell@gmail.com> - 2014-01-12 14:23 -0500

#63797 — efficient way to process data

FromLarry Martell <larry.martell@gmail.com>
Date2014-01-12 14:23 -0500
Subjectefficient way to process data
Message-ID<mailman.5383.1389555067.18130.python-list@python.org>
I have an python app that queries a MySQL DB. The query has this form:

SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
FROM t
GROUP BY a, b, c, d, f

x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
10053.490, 2542.094).

The business issue is that if either x or y in 2 rows that are in the
same a, b, c, d group are within 1 of each other then they should be
grouped together. And to make it more complicated, the tolerance is
applied as a rolling continuum. For example, if the x and y in a set
of grouped rows are:

row 1: 1.5, 9.5
row 2: 2.4, 20.8
row 3: 3.3, 40.6
row 4: 4.2, 2.5
row 5: 5.1, 10.1
row 6: 6.0, 7.9
row 7: 8.0, 21.0
row 8: 100, 200

1 through 6 get combined because all their X values are within the
tolerance of some other X in the set that's been combined. 7's Y value
is within the tolerance of 2's Y, so that should be combined as well.
8 is not combined because neither the X or Y value is within the
tolerance of any X or Y in the set that was combined.

AFAIK, there is no way to do this in SQL. In python I can easily parse
the data and identify the rows that need to be combined, but then I've
lost the ability to calculate the average and std across the combined
data set. The only way I can think of to do this is to remove the
grouping from the SQL and do all the grouping and aggregating myself.
But this query often returns 20k to 30k rows after grouping. It could
easily be 80k to 100k rows or more that I have to process if I remove
the grouping and I think that will end up being very slow.

Anyone have any ideas how I can efficiently do this?

Thanks!
-larry

[toc] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web