Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'aggregate': 0.07; 'column': 0.07; 'none,': 0.07; 'think,': 0.07; 'ascii': 0.09; 'clause': 0.09; 'derived': 0.09; 'part,': 0.09; 'pretend': 0.09; 'rows': 0.09; 'sql,': 0.09; 'subject:process': 0.09; 'type,': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'jan': 0.12; 'stored': 0.12; 'question.': 0.14; '(either': 0.16; '1.5,': 0.16; '3.3,': 0.16; '6.0,': 0.16; 'clause.': 0.16; 'columns': 0.16; 'combined.': 0.16; 'complicated,': 0.16; 'emit': 0.16; 'grouped': 0.16; 'magic': 0.16; 'other),': 0.16; 'pgsql': 0.16; 'simplified': 0.16; 'somewhere.': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'app': 0.19; 'trying': 0.19; 'any,': 0.19; 'select': 0.22; 'cc:addr:python.org': 0.22; 'creating': 0.23; 'form:': 0.24; 'merge': 0.24; 'stick': 0.24; 'fairly': 0.24; 'mon,': 0.24; 'together.': 0.24; 'cc:2**0': 0.24; 'script': 0.25; 'purposes': 0.26; 'query': 0.26; 'this:': 0.26; 'values': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; 'chris': 0.29; '[1]': 0.29; 'am,': 0.29; 'points': 0.29; "doesn't": 0.30; 'database,': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'posting': 0.31; 'reply.': 0.31; 'url:wiki': 0.31; '13,': 0.31; 'grouping': 0.31; 'larry': 0.31; 'probably': 0.32; 'figure': 0.32; 'run': 0.32; 'quite': 0.32; 'says': 0.33; 'running': 0.33; 'are:': 0.33; 'table': 0.34; "i'd": 0.34; 'could': 0.34; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'building': 0.35; 'there': 0.35; 'subject:data': 0.36; 'doing': 0.36; "didn't": 0.36; 'thanks': 0.36; 'possible': 0.36; 'url:org': 0.36; 'should': 0.36; 'example,': 0.37; 'so,': 0.37; 'too': 0.37; 'two': 0.37; 'list': 0.37; 'window': 0.38; 'whatever': 0.38; 'issue': 0.38; 'pm,': 0.38; 'previous': 0.38; 'that,': 0.38; 'does': 0.39; '12,': 0.39; 'sure': 0.39; 'either': 0.39; 'how': 0.40; 'analyze': 0.60; 'removing': 0.60; 'numbers': 0.61; "you're": 0.61; 'further': 0.61; 'back': 0.62; 'such': 0.63; 'more': 0.64; 'to:addr:gmail.com': 0.65; 'within': 0.65; 'effectively': 0.66; 'worth': 0.66; 'results': 0.69; 'business': 0.70; 'groups.': 0.74; 'analysis': 0.75; '(ie': 0.84; '10.1': 0.84; '9.5': 0.84; 'how.': 0.84; 'tolerance': 0.84; 'tricky': 0.84; 'thing,': 0.91; 'this!': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=NtnZm2sDOaEC1eplIdVVipPHTh/pkH0vfCsrmBM2YeI=; b=WVr5iX+ujsjLLCLjub31C/kSapDp2esliqKF836r2FxHGXHmUjHk2ZTPRMcA5fs5Ot 59oxLMH2hgxmDeRlxggE4KoZbn0Hhs0TOAGQnfSARZxpuuEuPe0GkiOqyufX2RVOoLfw kLIBHGqfQgwz56thCaKQeUBuen7Ovi+HueYikwpBOrwiP9cBM6BvQBcvARPYlez88vmv HvAWt1q8aAHSrlK0OKadb2flHf19EiH4tImbl3v6yVWGD31ASlenOvLXsLkpjlx0u911 zSp9KBTfYWFecjYSUQFcFai7H/s2Hc7Euv5QBulLfrdMbAC2O4McLahJ2sqUKEGOXmqQ 8Oww== MIME-Version: 1.0 X-Received: by 10.194.123.8 with SMTP id lw8mr19571597wjb.40.1389584147400; Sun, 12 Jan 2014 19:35:47 -0800 (PST) In-Reply-To: References: Date: Sun, 12 Jan 2014 22:35:47 -0500 Subject: Re: efficient way to process data From: Larry Martell To: Chris Angelico Content-Type: text/plain; charset=UTF-8 Cc: "python-list@python.org" X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 90 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1389584155 news.xs4all.nl 2854 [2001:888:2000:d::a6]:54974 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:63811 On Sun, Jan 12, 2014 at 6:27 PM, Chris Angelico wrote: > On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell wrote: >> I have an python app that queries a MySQL DB. The query has this form: >> >> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f >> FROM t >> GROUP BY a, b, c, d, f >> >> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or >> 10053.490, 2542.094). >> >> The business issue is that if either x or y in 2 rows that are in the >> same a, b, c, d group are within 1 of each other then they should be >> grouped together. And to make it more complicated, the tolerance is >> applied as a rolling continuum. For example, if the x and y in a set >> of grouped rows are: >> >> row 1: 1.5, 9.5 >> row 2: 2.4, 20.8 >> row 3: 3.3, 40.6 >> row 4: 4.2, 2.5 >> row 5: 5.1, 10.1 >> row 6: 6.0, 7.9 >> row 7: 8.0, 21.0 >> row 8: 100, 200 >> >> 1 through 6 get combined because all their X values are within the >> tolerance of some other X in the set that's been combined. 7's Y value >> is within the tolerance of 2's Y, so that should be combined as well. >> 8 is not combined because neither the X or Y value is within the >> tolerance of any X or Y in the set that was combined. > > Trying to get my head around this a bit more. Are columns a/b/c/d > treated as a big category (eg type, brand, category, model), such that > nothing will ever be grouped that has any difference in those four > columns? If so, we can effectively ignore them and pretend we have a > table with exactly one set (eg stick a WHERE clause onto the query > that stipulates their values). Then what you have is this: > > * Aggregate based on proximity of x and y > * Emit results derived from e > > Is that correct? There will be multiple groups of a/b/c/d. I simplified the query for the purposes of posting my question. There is a where clause with values that come from user input. None, any, or all of a, b, c, or d could be in the where clause. > So here's my way of writing it. > > * Subselect: List all values for x, in order, and figure out which > ones are less than the previous value plus one > * Subselect: Ditto, for y. > * Outer select: Somehow do an either-or group. I'm not quite sure how > to do that part, actually! > > A PGSQL window function would cover the two subselects - at least, I'm > fairly sure it would. I can't quite get the whole thing, though; I can > get a true/false flag that says whether it's near to the previous one > (that's easy), and creating a grouping column value should be possible > from that but I'm not sure how. > > But an either-or grouping is a bit trickier. The best I can think of > is to collect all the y values for each group of x values, and then if > any two groups 'overlap' (ie have points within 1.0 of each other), > merge the groups. That's going to be seriously tricky to do in SQL, I > think, so you may have to go back to Python on that one. > > My analysis suggests that, whatever happens, you're going to need > every single y value somewhere. So it's probably not worth trying to > do any grouping/aggregation in SQL, since you need to further analyze > all the individual data points. I can't think of any way better than > just leafing through the whole table (either in Python or in a stored > procedure - if you can run your script on the same computer that's > running the database, I'd do that, otherwise consider a stored > procedure to reduce network transfers) and building up mappings. > > Of course, "I can't think of a way" does not equate to "There is no > way". There may be some magic trick that I didn't think of, or some > arcane incantation that gets what you want. Who knows? If you can > produce an ASCII art Mandelbrot set [1] in pure SQL, why not this! > > ChrisA > > [1] http://wiki.postgresql.org/wiki/Mandelbrot_set Thanks for the reply. I'm going to take a stab at removing the group by and doing it all in python. It doesn't look too hard, but I don't know how it will perform.