Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #63806
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.tele.dk!feed118.news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <rosuav@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'aggregate': 0.07; 'column': 0.07; 'think,': 0.07; 'ascii': 0.09; 'clause': 0.09; 'derived': 0.09; 'part,': 0.09; 'pretend': 0.09; 'rows': 0.09; 'sql,': 0.09; 'subject:process': 0.09; 'type,': 0.09; 'cc:addr :python-list': 0.11; 'python': 0.11; 'jan': 0.12; 'stored': 0.12; '(either': 0.16; '1.5,': 0.16; '3.3,': 0.16; '6.0,': 0.16; 'columns': 0.16; 'combined.': 0.16; 'complicated,': 0.16; 'emit': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'grouped': 0.16; 'magic': 0.16; 'other),': 0.16; 'pgsql': 0.16; 'somewhere.': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'app': 0.19; 'trying': 0.19; 'select': 0.22; 'cc:addr:python.org': 0.22; 'creating': 0.23; 'form:': 0.24; 'merge': 0.24; 'stick': 0.24; 'fairly': 0.24; 'mon,': 0.24; 'together.': 0.24; 'cc:2**0': 0.24; 'script': 0.25; 'query': 0.26; 'this:': 0.26; 'values': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'function': 0.29; '[1]': 0.29; 'am,': 0.29; 'points': 0.29; 'database,': 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'url:wiki': 0.31; '13,': 0.31; 'grouping': 0.31; 'larry': 0.31; 'probably': 0.32; 'figure': 0.32; 'run': 0.32; 'quite': 0.32; 'says': 0.33; 'running': 0.33; 'are:': 0.33; 'table': 0.34; "i'd": 0.34; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'building': 0.35; 'there': 0.35; 'subject:data': 0.36; "didn't": 0.36; 'possible': 0.36; 'url:org': 0.36; 'should': 0.36; 'example,': 0.37; 'so,': 0.37; 'two': 0.37; 'list': 0.37; 'window': 0.38; 'whatever': 0.38; 'issue': 0.38; 'previous': 0.38; 'that,': 0.38; 'does': 0.39; 'sure': 0.39; 'either': 0.39; 'how': 0.40; 'analyze': 0.60; 'numbers': 0.61; "you're": 0.61; 'further': 0.61; 'back': 0.62; 'such': 0.63; 'more': 0.64; 'within': 0.65; 'effectively': 0.66; 'worth': 0.66; 'results': 0.69; 'business': 0.70; 'groups.': 0.74; 'analysis': 0.75; '(ie': 0.84; '10.1': 0.84; '9.5': 0.84; 'how.': 0.84; 'tolerance': 0.84; 'tricky': 0.84; 'thing,': 0.91; 'to:none': 0.92; 'this!': 0.93 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=kG1fnvmpABleSh3MsOZK8yBbj44xjhqg2gNszLpbquU=; b=y6TZTO1XZ/EL9jhSpmpx7fH398xHVyvduZnfkriRRY88//Rf18UkiptTA5ULqQGo2x RN7l1RKDtMxRUVFaf1oKYk60Gy3Bf/93InTvTUxGEF4MpimfIae+PdVtWr7EvhyrndRq HvKlmtFsLBR0ldV/AWRUW2xe7ZgPLs7IPvnk/6qZpbDXBe7vRFHuo3I8+rUAceIWZJmq tBJmGfiRfCDGp3xRvsN2VjKlfmINSnfbWtQCODQeiNK/UQ2rgWZjbWp+FwAs8lXZ7dtS rvijSv5W/yrrJ5nbO0A491OZ3gu0rElscgdFvvnVjP2OI+lnJ3kXj6plIn/yUbcsP96P arSA== |
| MIME-Version | 1.0 |
| X-Received | by 10.68.183.164 with SMTP id en4mr16874pbc.169.1389569222184; Sun, 12 Jan 2014 15:27:02 -0800 (PST) |
| In-Reply-To | <CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com> |
| References | <CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com> |
| Date | Mon, 13 Jan 2014 10:27:02 +1100 |
| Subject | Re: efficient way to process data |
| From | Chris Angelico <rosuav@gmail.com> |
| Cc | "python-list@python.org" <python-list@python.org> |
| Content-Type | text/plain; charset=UTF-8 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.5390.1389569231.18130.python-list@python.org> (permalink) |
| Lines | 80 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1389569231 news.xs4all.nl 2830 [2001:888:2000:d::a6]:37619 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:63806 |
Show key headers only | View raw
On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell@gmail.com> wrote: > I have an python app that queries a MySQL DB. The query has this form: > > SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f > FROM t > GROUP BY a, b, c, d, f > > x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or > 10053.490, 2542.094). > > The business issue is that if either x or y in 2 rows that are in the > same a, b, c, d group are within 1 of each other then they should be > grouped together. And to make it more complicated, the tolerance is > applied as a rolling continuum. For example, if the x and y in a set > of grouped rows are: > > row 1: 1.5, 9.5 > row 2: 2.4, 20.8 > row 3: 3.3, 40.6 > row 4: 4.2, 2.5 > row 5: 5.1, 10.1 > row 6: 6.0, 7.9 > row 7: 8.0, 21.0 > row 8: 100, 200 > > 1 through 6 get combined because all their X values are within the > tolerance of some other X in the set that's been combined. 7's Y value > is within the tolerance of 2's Y, so that should be combined as well. > 8 is not combined because neither the X or Y value is within the > tolerance of any X or Y in the set that was combined. Trying to get my head around this a bit more. Are columns a/b/c/d treated as a big category (eg type, brand, category, model), such that nothing will ever be grouped that has any difference in those four columns? If so, we can effectively ignore them and pretend we have a table with exactly one set (eg stick a WHERE clause onto the query that stipulates their values). Then what you have is this: * Aggregate based on proximity of x and y * Emit results derived from e Is that correct? So here's my way of writing it. * Subselect: List all values for x, in order, and figure out which ones are less than the previous value plus one * Subselect: Ditto, for y. * Outer select: Somehow do an either-or group. I'm not quite sure how to do that part, actually! A PGSQL window function would cover the two subselects - at least, I'm fairly sure it would. I can't quite get the whole thing, though; I can get a true/false flag that says whether it's near to the previous one (that's easy), and creating a grouping column value should be possible from that but I'm not sure how. But an either-or grouping is a bit trickier. The best I can think of is to collect all the y values for each group of x values, and then if any two groups 'overlap' (ie have points within 1.0 of each other), merge the groups. That's going to be seriously tricky to do in SQL, I think, so you may have to go back to Python on that one. My analysis suggests that, whatever happens, you're going to need every single y value somewhere. So it's probably not worth trying to do any grouping/aggregation in SQL, since you need to further analyze all the individual data points. I can't think of any way better than just leafing through the whole table (either in Python or in a stored procedure - if you can run your script on the same computer that's running the database, I'd do that, otherwise consider a stored procedure to reduce network transfers) and building up mappings. Of course, "I can't think of a way" does not equate to "There is no way". There may be some magic trick that I didn't think of, or some arcane incantation that gets what you want. Who knows? If you can produce an ASCII art Mandelbrot set [1] in pure SQL, why not this! ChrisA [1] http://wiki.postgresql.org/wiki/Mandelbrot_set
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: efficient way to process data Chris Angelico <rosuav@gmail.com> - 2014-01-13 10:27 +1100
csiph-web