Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #63812

Re: efficient way to process data

Path csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.tele.dk!feed118.news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.002
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'python.': 0.02; 'else:': 0.03; 'aggregate': 0.07; 'postgresql': 0.07; '[];': 0.09; 'indexes': 0.09; 'linear': 0.09; 'lst': 0.09; 'subject:process': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'jan': 0.12; '-999': 0.16; 'algorithmic': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'merged': 0.16; 'messy': 0.16; 'such,': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'code,': 0.22; 'cc:addr:python.org': 0.22; 'filtering': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; 'switch': 0.26; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; "doesn't": 0.30; 'database,': 0.30; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'reply.': 0.31; 'too.': 0.31; '13,': 0.31; 'larry': 0.31; 'option.': 0.31; 'remotely': 0.31; "they'll": 0.31; 'lists': 0.32; "can't": 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'subject:data': 0.36; 'doing': 0.36; 'thanks': 0.36; 'should': 0.36; 'too': 0.37; 'list': 0.37; 'list.': 0.37; 'being': 0.38; 'handle': 0.38; 'whatever': 0.38; 'pm,': 0.38; 'anything': 0.39; 'aspects': 0.39; 'though,': 0.39; 'sure': 0.39; 'how': 0.40; 'even': 0.60; 'removing': 0.60; "you're": 0.61; 'such': 0.63; 'group,': 0.63; 'complexity': 0.84; 'overall,': 0.84; 'partial': 0.84; 'start.': 0.84; 'x):': 0.84; 'to:none': 0.92
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=7/s/mE2ojA1rU41GOmtwH0dvXxuvSxC6gq4HSZD1ENU=; b=cczk91SsO4d53OPJRKZV+VALkLqryeHOag2MsP5kyaxep/iSNFK0Jjeaj8jY2oEzz4 Y6aUnj91wLxmCi4vgnc4nkGKdsfL77lpEzzWTqOs3oOjp+1zl40tGZBJcS9pSIfw68TD /VSJ3ZRXayEoxunlHlB4ihvfOhGATDEQSpl0hPKCfxvpiShwHKJGk+dDJHTHKYl4H4V1 yW440VbkRat9oRMBO9K76gxj458MtFCVAUmAUrX3uPHj8x9rfeyhp47Iq5ARbIZjiSpz 9q/wxtfz3EHg2PXpNGdZOBtab2lSeNMv81FQrKvzL/WgTrMWZ1TSvt0lKXcz4CBsYB20 HYpQ==
MIME-Version 1.0
X-Received by 10.68.247.6 with SMTP id ya6mr27404027pbc.45.1389593399741; Sun, 12 Jan 2014 22:09:59 -0800 (PST)
In-Reply-To <CACwCsY6W3x+8QzkJ8GR7qeck9s4xkwkhUbRt4k2iBAGcFuUb2g@mail.gmail.com>
References <CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com> <CAPTjJmpGKzCbB4ZMZG=UAhh5hq8JowUOuerS1dk-rt3T6_qCyw@mail.gmail.com> <CACwCsY6W3x+8QzkJ8GR7qeck9s4xkwkhUbRt4k2iBAGcFuUb2g@mail.gmail.com>
Date Mon, 13 Jan 2014 17:09:59 +1100
Subject Re: efficient way to process data
From Chris Angelico <rosuav@gmail.com>
Cc "python-list@python.org" <python-list@python.org>
Content-Type text/plain; charset=UTF-8
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.5395.1389593409.18130.python-list@python.org> (permalink)
Lines 56
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1389593409 news.xs4all.nl 2858 [2001:888:2000:d::a6]:53526
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:63812

Show key headers only | View raw


On Mon, Jan 13, 2014 at 2:35 PM, Larry Martell <larry.martell@gmail.com> wrote:
> Thanks for the reply. I'm going to take a stab at removing the group
> by and doing it all in python. It doesn't look too hard, but I don't
> know how it will perform.

Well, if you can't switch to PostgreSQL or such, then doing it in
Python is your only option. There are such things as GiST and GIN
indexes that might be able to do some of this magic, but I don't think
MySQL has anything even remotely like what you're looking for.

So ultimately, you're going to have to do your filtering on the
database, and then all the aggregation in Python. And it's going to be
somewhat complicated code, too. Best I can think of is this, as
partial pseudo-code:

last_x = -999
x_map = []; y_map = {}
merge_me = []
for x,y,e in (SELECT x,y,e FROM t WHERE whatever ORDER BY x):
    if x<last_x+1:
        x_map[-1].append((y,e))
    else:
        x_map.append([(y,e)])
    last_x=x
    if y in y_map:
        merge_me.append((y_map[y], x_map[-1]))
    y_map[y]=x_map[-1]

# At this point, you have x_map which is a list of lists, each one
# being one group, and y_map which maps a y value to its x_map list.

last_y = -999
for y in sorted(y_map.keys()):
    if y<last_y+1:
        merge_me.append((y_map[y], last_x_map))
    last_y=y
    last_x_map=y_map[y]

for merge1,merge2 in merge_me:
    merge1.extend(merge2)
    merge2[:]=[] # Empty out the list

for lst in x_map:
    if not lst: continue # been emptied out, ignore it
    do aggregate stats, get sum(lst) and whatever else

I think this should be linear complexity overall, but there may be a
few aspects of it that are quadratic. It's a tad messy though, and
completely untested. But that's an algorithmic start. The idea is that
lists get collected based on x proximity, and then lists get merged
based on y proximity. That is, if you have (1.0, 10.1), (1.5, 2.3),
(3.0, 11.0), (3.2, 15.2), they'll all be treated as a single
aggregation unit. If that's not what you want, I'm not sure how to
handle it.

ChrisA

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: efficient way to process data Chris Angelico <rosuav@gmail.com> - 2014-01-13 17:09 +1100

csiph-web