Re: efficient way to process data

Path	csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<larry.martell@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.003
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'python.': 0.02; 'else:': 0.03; 'aggregate': 0.07; 'postgresql': 0.07; '[];': 0.09; 'indexes': 0.09; 'linear': 0.09; 'lst': 0.09; 'subject:process': 0.09; 'that).': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'jan': 0.12; 'thread': 0.14; '-999': 0.16; '1:09': 0.16; 'algorithmic': 0.16; 'merged': 0.16; 'messy': 0.16; 'such,': 0.16; 'ignore': 0.16; 'wrote:': 0.18; 'thanks.': 0.20; 'code,': 0.22; 'cc:addr:python.org': 0.22; 'filtering': 0.24; 'mon,': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'switch': 0.26; 'task': 0.26; 'header:In-Reply-To:1': 0.27; 'idea': 0.28; 'chris': 0.29; 'am,': 0.29; "doesn't": 0.30; 'database,': 0.30; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'reply.': 0.31; 'too.': 0.31; '13,': 0.31; 'larry': 0.31; 'option.': 0.31; 'remotely': 0.31; "they'll": 0.31; 'lists': 0.32; 'this.': 0.32; "can't": 0.35; 'something': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'subject:data': 0.36; 'doing': 0.36; 'thanks': 0.36; "i'll": 0.36; 'should': 0.36; 'too': 0.37; 'list': 0.37; 'list.': 0.37; 'being': 0.38; 'handle': 0.38; 'whatever': 0.38; 'pm,': 0.38; 'anything': 0.39; 'aspects': 0.39; 'though,': 0.39; 'sure': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'removing': 0.60; "you're": 0.61; 'back': 0.62; 'such': 0.63; 'group,': 0.63; 'to:addr:gmail.com': 0.65; 'low': 0.83; 'complexity': 0.84; 'overall,': 0.84; 'partial': 0.84; 'revive': 0.84; 'start.': 0.84; 'x):': 0.84; 'hate': 0.91
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=d9fkXSEYsI9ObgWqROpMDF/7pFBT4VZxRSO1BLf7p9U=; b=dGGtxL4ngEh+29wgXI550lqzEPzNd9TKK/WGaaR+eppO1JI1El/LZKoEK5xaPy1iCz ld56KXs9hLQeWEzibCP07UNZCGf9KM1bVM2FPXHWRKwY+c+jhRBBxZbgqF2SIy1eJTlI oTyNUmAJd6xET3kE/qx32y8My+NHyR2rtNjRCDkpxwUS7QtK9rN9nXCgtdhm9IJUNn4K 8coCQHHW/uUwkdPD6uC25n7w0ef1Hzgx7Ja5QqGGTTe9kWXkR5PxjhXKfACt3ABbiDPE 2zVlTa3YOKxZo+58UvE9phjF9hsvcSVTekHBIrFyC51uk5kM3y5mieV/deSKGG5zqkhT vIkQ==
MIME-Version	1.0
X-Received	by 10.194.133.34 with SMTP id oz2mr23304739wjb.14.1389637657777; Mon, 13 Jan 2014 10:27:37 -0800 (PST)
In-Reply-To	<CAPTjJmp82qrVz03sZFLXyju0oL-3=qJrcn34UJU6N4qo+wF6tw@mail.gmail.com>
References	<CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com> <CAPTjJmpGKzCbB4ZMZG=UAhh5hq8JowUOuerS1dk-rt3T6_qCyw@mail.gmail.com> <CACwCsY6W3x+8QzkJ8GR7qeck9s4xkwkhUbRt4k2iBAGcFuUb2g@mail.gmail.com> <CAPTjJmp82qrVz03sZFLXyju0oL-3=qJrcn34UJU6N4qo+wF6tw@mail.gmail.com>
Date	Mon, 13 Jan 2014 13:27:37 -0500
Subject	Re: efficient way to process data
From	Larry Martell <larry.martell@gmail.com>
To	Chris Angelico <rosuav@gmail.com>
Content-Type	text/plain; charset=UTF-8
Cc	"python-list@python.org" <python-list@python.org>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.5422.1389637659.18130.python-list@python.org> (permalink)
Lines	59
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1389637659 news.xs4all.nl 2970 [2001:888:2000:d::a6]:55174
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:63847

Show key headers only | View raw

On Mon, Jan 13, 2014 at 1:09 AM, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, Jan 13, 2014 at 2:35 PM, Larry Martell <larry.martell@gmail.com> wrote:
>> Thanks for the reply. I'm going to take a stab at removing the group
>> by and doing it all in python. It doesn't look too hard, but I don't
>> know how it will perform.
>
> Well, if you can't switch to PostgreSQL or such, then doing it in
> Python is your only option. There are such things as GiST and GIN
> indexes that might be able to do some of this magic, but I don't think
> MySQL has anything even remotely like what you're looking for.
>
> So ultimately, you're going to have to do your filtering on the
> database, and then all the aggregation in Python. And it's going to be
> somewhat complicated code, too. Best I can think of is this, as
> partial pseudo-code:
>
> last_x = -999
> x_map = []; y_map = {}
> merge_me = []
> for x,y,e in (SELECT x,y,e FROM t WHERE whatever ORDER BY x):
>     if x<last_x+1:
>         x_map[-1].append((y,e))
>     else:
>         x_map.append([(y,e)])
>     last_x=x
>     if y in y_map:
>         merge_me.append((y_map[y], x_map[-1]))
>     y_map[y]=x_map[-1]
>
> # At this point, you have x_map which is a list of lists, each one
> # being one group, and y_map which maps a y value to its x_map list.
>
> last_y = -999
> for y in sorted(y_map.keys()):
>     if y<last_y+1:
>         merge_me.append((y_map[y], last_x_map))
>     last_y=y
>     last_x_map=y_map[y]
>
> for merge1,merge2 in merge_me:
>     merge1.extend(merge2)
>     merge2[:]=[] # Empty out the list
>
> for lst in x_map:
>     if not lst: continue # been emptied out, ignore it
>     do aggregate stats, get sum(lst) and whatever else
>
> I think this should be linear complexity overall, but there may be a
> few aspects of it that are quadratic. It's a tad messy though, and
> completely untested. But that's an algorithmic start. The idea is that
> lists get collected based on x proximity, and then lists get merged
> based on y proximity. That is, if you have (1.0, 10.1), (1.5, 2.3),
> (3.0, 11.0), (3.2, 15.2), they'll all be treated as a single
> aggregation unit. If that's not what you want, I'm not sure how to
> handle it.

Thanks. Unfortunately this has been made a low priority task and I've
been put on to something else (I hate when they do that). I'll revive
this thread when I'm allowed to get back on this.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: efficient way to process data Larry Martell <larry.martell@gmail.com> - 2014-01-13 13:27 -0500

csiph-web