Re: efficient way to process data

Path	csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<larry.martell@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'output': 0.05; 'correct.': 0.07; 'string': 0.09; 'converted': 0.09; 'performs': 0.09; 'rows': 0.09; 'subject:process': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'jan': 0.12; '"a"': 0.16; '"x"': 0.16; '(x,': 0.16; 'comma': 0.16; 'complicated,': 0.16; 'decimal.': 0.16; 'delimiter': 0.16; 'emit': 0.16; 'equal.': 0.16; 'germane': 0.16; 'grouped': 0.16; 'latter,': 0.16; 'partitioning': 0.16; '(you': 0.16; 'wrote:': 0.18; 'looked': 0.18; 'app': 0.19; 'basically': 0.19; 'working.': 0.19; 'not,': 0.20; 'input': 0.22; 'cc:addr:python.org': 0.22; 'char': 0.24; 'form:': 0.24; 'specify': 0.24; 'together.': 0.24; 'file.': 0.24; 'cc:2**0': 0.24; 'query': 0.26; 'primary': 0.26; 'header:In-Reply-To:1': 0.27; 'record': 0.27; 'correct': 0.29; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'comparison': 0.31; 'decimal': 0.31; 'larry': 0.31; 'probably': 0.32; 'option': 0.32; 'are:': 0.33; 'not.': 0.33; 'maybe': 0.34; 'could': 0.34; 'problem': 0.35; 'something': 0.35; 'operate': 0.35; 'requirement': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'joined': 0.36; 'subject:data': 0.36; 'doing': 0.36; "i'll": 0.36; 'should': 0.36; 'example,': 0.37; 'two': 0.37; 'being': 0.38; 'issue': 0.38; 'pm,': 0.38; '12,': 0.39; 'sure': 0.39; 'either': 0.39; 'how': 0.40; 'even': 0.60; 'numbers': 0.61; 'matter': 0.61; 'strictly': 0.61; 'first': 0.61; 'more': 0.64; 'within': 0.65; 'business': 0.70; 'yourself': 0.78; 'costly': 0.84; 'performs.': 0.84; 'tolerance': 0.84
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=+LpY0m+qacWVHLU5Kh5bjSBBuVa8iw3DBOjkbire2b0=; b=QDMWy7tsYhgEHvv0vXL93Aje7hhsmSE2Z/I4ZeeVDL9NwuKUzDCa49bTNsHlvM908g sbayuj3sdxE1SFwGMUvPSxz4VAiA22bCngTKeKVvNY8konaZl52l6LnY6gjSKEGaCsbS 0VvqW4O1i/o8jqnGHMjX4zjScFE+LV6TUYymjyN8IQWaLWtSJS4G4hNWVhHUBsYXtjB2 A2WMHdXgMINNcSSpzAYaYafJIYouvZqTqB+Xv5vAXbVzg/pLE2WIzMmxGyIyfX9xvLz9 U90ayl0bWIvzE1oGhJ+kHZ77oOhOuhAkIKOeH4oYvW0siCZNbsnY1WcOE/UZz+h8FKYR W5AQ==
MIME-Version	1.0
X-Received	by 10.194.175.225 with SMTP id cd1mr33282wjc.93.1389583557975; Sun, 12 Jan 2014 19:25:57 -0800 (PST)
In-Reply-To	<rm56d9htr0u383gur3moj552d56nvivg66@4ax.com>
References	<CACwCsY6KBDVkS5jCMh9GyvhHyVgqcAH3YAYnGpMQvfBwexaTcw@mail.gmail.com> <rm56d9htr0u383gur3moj552d56nvivg66@4ax.com>
Date	Sun, 12 Jan 2014 22:25:57 -0500
Subject	Re: efficient way to process data
From	Larry Martell <larry.martell@gmail.com>
To	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Content-Type	text/plain; charset=UTF-8
Cc	"python-list@python.org" <python-list@python.org>
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.5393.1389583565.18130.python-list@python.org> (permalink)
Lines	63
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1389583565 news.xs4all.nl 2829 [2001:888:2000:d::a6]:51608
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:63810

Show key headers only | View raw

On Sun, Jan 12, 2014 at 5:43 PM, Dennis Lee Bieber
<wlfraed@ix.netcom.com> wrote:
> On Sun, 12 Jan 2014 14:23:17 -0500, Larry Martell <larry.martell@gmail.com>
> declaimed the following:
>
>>I have an python app that queries a MySQL DB. The query has this form:
>>
>>SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>>FROM t
>>GROUP BY a, b, c, d, f
>>
>>x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>>10053.490, 2542.094).
>>
>
>         Decimal (Numeric) or floating/real. If the latter, the internal storage
> may not be exact (378.1811111111 and 378.179999999 may both "display" as
> 378.18, but will not match for grouping).

In the database they are decimal. They are being converted to char by
the CONCAT(x, ',', y).

>>The business issue is that if either x or y in 2 rows that are in the
>>same a, b, c, d group are within 1 of each other then they should be
>>grouped together. And to make it more complicated, the tolerance is
>>applied as a rolling continuum. For example, if the x and y in a set
>>of grouped rows are:
>>
>         As I understand group by, it will first group by "a", WITHIN the "a"
> groups it will then group by "b"... Probably not a matter germane to the
> problem as you are concerning yourself with the STRING representation of
> "x" and "y" with a comma delimiter -- which is only looked at if the
> "a,b,c,d" are equal... Thing is, a string comparison is going to operate
> strictly left to right -- it won't even see your "y" value unless all the
> "x" value is equal.

Yes, that is correct. The original requirement was to group by (X, Y),
so the CONCAT(x, ',', y) was correct and working. Then the requirement
was change to apply the tolerance as I described.

>
>         You may need to operate using subselects... So that you can specify
> something like
>
>         where   abs(s1.x -s2.x) < tolerance or abs(s1.y-s2.y) < tolerance
>                 and (s1.a = s2.a ... s1.d = s2.d)
>
> s1/s1 are the subselects (you may need a primary key <> primary key to
> avoid having it output a record where the two subselects are for the SAME
> record -- or maybe not, since you /do/ want that record also output). Going
> to be a costly query since you are basically doing
>
>         foreach r1 in s1
>                 foreach r2 in s2
>                         emit r2 when...

Speed is an issue here, and while the current query performs well, in
my experience subqueries and self joins do not. I'm going to try and
do it all in python and see how it performs. The other option is to
pre-process the data on the way into the database. Doing that will
eliminate some of the data partitioning as all of the data that could
be joined will be in the same input file. I'm just not sure if it will
OK to actually munge the data. I'll find that out tomorrow.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: efficient way to process data Larry Martell <larry.martell@gmail.com> - 2014-01-12 22:25 -0500

csiph-web