Re: efficient way to process data

Path	csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.tele.dk!feed118.news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<rosuav@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'aggregate': 0.07; 'column': 0.07; 'think,': 0.07; 'ascii': 0.09; 'clause': 0.09; 'derived': 0.09; 'part,': 0.09; 'pretend': 0.09; 'rows': 0.09; 'sql,': 0.09; 'subject:process': 0.09; 'type,': 0.09; 'cc:addr :python-list': 0.11; 'python': 0.11; 'jan': 0.12; 'stored': 0.12; '(either': 0.16; '1.5,': 0.16; '3.3,': 0.16; '6.0,': 0.16; 'columns': 0.16; 'combined.': 0.16; 'complicated,': 0.16; 'emit': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'grouped': 0.16; 'magic': 0.16; 'other),': 0.16; 'pgsql': 0.16; 'somewhere.': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'app': 0.19; 'trying': 0.19; 'select': 0.22; 'cc:addr:python.org': 0.22; 'creating': 0.23; 'form:': 0.24; 'merge': 0.24; 'stick': 0.24; 'fairly': 0.24; 'mon,': 0.24; 'together.': 0.24; 'cc:2**0': 0.24; 'script': 0.25; 'query': 0.26; 'this:': 0.26; 'values': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; '[1]': 0.29; 'am,': 0.29; 'points': 0.29; 'database,': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'url:wiki': 0.31; '13,': 0.31; 'grouping': 0.31; 'larry': 0.31; 'probably': 0.32; 'figure': 0.32; 'run': 0.32; 'quite': 0.32; 'says': 0.33; 'running': 0.33; 'are:': 0.33; 'table': 0.34; "i'd": 0.34; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'building': 0.35; 'there': 0.35; 'subject:data': 0.36; "didn't": 0.36; 'possible': 0.36; 'url:org': 0.36; 'should': 0.36; 'example,': 0.37; 'so,': 0.37; 'two': 0.37; 'list': 0.37; 'window': 0.38; 'whatever': 0.38; 'issue': 0.38; 'previous': 0.38; 'that,': 0.38; 'does': 0.39; 'sure': 0.39; 'either': 0.39; 'how': 0.40; 'analyze': 0.60; 'numbers': 0.61; "you're": 0.61; 'further': 0.61; 'back': 0.62; 'such': 0.63; 'more': 0.64; 'within': 0.65; 'effectively': 0.66; 'worth': 0.66; 'results': 0.69; 'business': 0.70; 'groups.': 0.74; 'analysis': 0.75; '(ie': 0.84; '10.1': 0.84; '9.5': 0.84; 'how.': 0.84; 'tolerance': 0.84; 'tricky': 0.84; 'thing,': 0.91; 'to:none': 0.92; 'this!': 0.93
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=kG1fnvmpABleSh3MsOZK8yBbj44xjhqg2gNszLpbquU=; b=y6TZTO1XZ/EL9jhSpmpx7fH398xHVyvduZnfkriRRY88//Rf18UkiptTA5ULqQGo2x RN7l1RKDtMxRUVFaf1oKYk60Gy3Bf/93InTvTUxGEF4MpimfIae+PdVtWr7EvhyrndRq HvKlmtFsLBR0ldV/AWRUW2xe7ZgPLs7IPvnk/6qZpbDXBe7vRFHuo3I8+rUAceIWZJmq tBJmGfiRfCDGp3xRvsN2VjKlfmINSnfbWtQCODQeiNK/UQ2rgWZjbWp+FwAs8lXZ7dtS rvijSv5W/yrrJ5nbO0A491OZ3gu0rElscgdFvvnVjP2OI+lnJ3kXj6plIn/yUbcsP96P arSA==
MIME-Version	1.0
X-Received	by 10.68.183.164 with SMTP id en4mr16874pbc.169.1389569222184; Sun, 12 Jan 2014 15:27:02 -0800 (PST)
In-Reply-To	<CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com>
References	<CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com>
Date	Mon, 13 Jan 2014 10:27:02 +1100
Subject	Re: efficient way to process data
From	Chris Angelico <rosuav@gmail.com>
Cc	"python-list@python.org" <python-list@python.org>
Content-Type	text/plain; charset=UTF-8
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.5390.1389569231.18130.python-list@python.org> (permalink)
Lines	80
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1389569231 news.xs4all.nl 2830 [2001:888:2000:d::a6]:37619
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:63806

Show key headers only | View raw

On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell@gmail.com> wrote:
> I have an python app that queries a MySQL DB. The query has this form:
>
> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
> FROM t
> GROUP BY a, b, c, d, f
>
> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
> 10053.490, 2542.094).
>
> The business issue is that if either x or y in 2 rows that are in the
> same a, b, c, d group are within 1 of each other then they should be
> grouped together. And to make it more complicated, the tolerance is
> applied as a rolling continuum. For example, if the x and y in a set
> of grouped rows are:
>
> row 1: 1.5, 9.5
> row 2: 2.4, 20.8
> row 3: 3.3, 40.6
> row 4: 4.2, 2.5
> row 5: 5.1, 10.1
> row 6: 6.0, 7.9
> row 7: 8.0, 21.0
> row 8: 100, 200
>
> 1 through 6 get combined because all their X values are within the
> tolerance of some other X in the set that's been combined. 7's Y value
> is within the tolerance of 2's Y, so that should be combined as well.
> 8 is not combined because neither the X or Y value is within the
> tolerance of any X or Y in the set that was combined.

Trying to get my head around this a bit more. Are columns a/b/c/d
treated as a big category (eg type, brand, category, model), such that
nothing will ever be grouped that has any difference in those four
columns? If so, we can effectively ignore them and pretend we have a
table with exactly one set (eg stick a WHERE clause onto the query
that stipulates their values). Then what you have is this:

* Aggregate based on proximity of x and y
* Emit results derived from e

Is that correct?

So here's my way of writing it.

* Subselect: List all values for x, in order, and figure out which
ones are less than the previous value plus one
* Subselect: Ditto, for y.
* Outer select: Somehow do an either-or group. I'm not quite sure how
to do that part, actually!

A PGSQL window function would cover the two subselects - at least, I'm
fairly sure it would. I can't quite get the whole thing, though; I can
get a true/false flag that says whether it's near to the previous one
(that's easy), and creating a grouping column value should be possible
from that but I'm not sure how.

But an either-or grouping is a bit trickier. The best I can think of
is to collect all the y values for each group of x values, and then if
any two groups 'overlap' (ie have points within 1.0 of each other),
merge the groups. That's going to be seriously tricky to do in SQL, I
think, so you may have to go back to Python on that one.

My analysis suggests that, whatever happens, you're going to need
every single y value somewhere. So it's probably not worth trying to
do any grouping/aggregation in SQL, since you need to further analyze
all the individual data points. I can't think of any way better than
just leafing through the whole table (either in Python or in a stored
procedure - if you can run your script on the same computer that's
running the database, I'd do that, otherwise consider a stored
procedure to reduce network transfers) and building up mappings.

Of course, "I can't think of a way" does not equate to "There is no
way". There may be some magic trick that I didn't think of, or some
arcane incantation that gets what you want. Who knows? If you can
produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!

ChrisA

[1] http://wiki.postgresql.org/wiki/Mandelbrot_set

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: efficient way to process data Chris Angelico <rosuav@gmail.com> - 2014-01-13 10:27 +1100

csiph-web