Groups > comp.lang.python > #101066 > unrolled thread

Python Data Analysis Recommendations

Started by	Rob Gaddi <rgaddi@highlandtechnology.invalid>
First post	2015-12-31 17:15 +0000
Last post	2016-01-01 23:25 +0530
Articles	4 — 4 participants

Back to article view | Back to comp.lang.python

  Python Data Analysis Recommendations Rob Gaddi <rgaddi@highlandtechnology.invalid> - 2015-12-31 17:15 +0000
    Re: Python Data Analysis Recommendations Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-01-01 21:24 +0000
      Re: Python Data Analysis Recommendations Ravi Narasimhan <backscatter@rettacs.org> - 2016-01-01 14:16 -0800
    Re: Python Data Analysis Recommendations Sameer Grover <sameer.grover.1@gmail.com> - 2016-01-01 23:25 +0530

#101066 — Python Data Analysis Recommendations

From	Rob Gaddi <rgaddi@highlandtechnology.invalid>
Date	2015-12-31 17:15 +0000
Subject	Python Data Analysis Recommendations
Message-ID	<n63nrs$5c0$1@dont-email.me>

I'm looking for some advice on handling data collection/analysis in
Python.  I do a lot of big, time consuming experiments in which I run a
long data collection (a day or a weekend) in which I sweep a bunch of
variables, then come back offline and try to cut the data into something
that makes sense.

For example, my last data collection looked (neglecting all the actual
equipment control code in each loop) like:

for t in temperatures:
  for r in voltage_ranges:
    for v in test_voltages[r]:
      for c in channels:
        for n in range(100):
          record_data()

I've been using Sqlite (through peewee) as the data backend, setting up
a couple tables with a basically hierarchical relationship, and then
handling analysis with a rough cut of SQL queries against the
original data, Numpy/Scipy for further refinement, and Matplotlib
to actually do the visualization.  For example, one graph was "How does
the slope of straight line fit between measured and applied voltage vary
as a function of temperature on each channel?"

The whole process feels a bit grindy; like I keep having to do a lot of
ad-hoc stitching things together.  And I keep hearing about pandas,
PyTables, and HDF5.  Would that be making my life notably easier?  If
so, does anyone have any references on it that they've found
particularly useful?  The tutorials I've seen so far seem to not give
much detail on what the point of what they're doing is; it's all "how
you write the code" rather than "why you write the code".  Paying money
for books is acceptable; this is all on the company's time/dime.

Thanks,
Rob

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com

Email address domain is currently out of order.  See above to fix.

[toc] | [next] | [standalone]

#101122

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2016-01-01 21:24 +0000
Message-ID	<mailman.151.1451683499.11925.python-list@python.org>
In reply to	#101066

On 31/12/2015 17:15, Rob Gaddi wrote:
> I'm looking for some advice on handling data collection/analysis in
> Python.  I do a lot of big, time consuming experiments in which I run a
> long data collection (a day or a weekend) in which I sweep a bunch of
> variables, then come back offline and try to cut the data into something
> that makes sense.
>
> For example, my last data collection looked (neglecting all the actual
> equipment control code in each loop) like:
>
> for t in temperatures:
>    for r in voltage_ranges:
>      for v in test_voltages[r]:
>        for c in channels:
>          for n in range(100):
>            record_data()
>
> I've been using Sqlite (through peewee) as the data backend, setting up
> a couple tables with a basically hierarchical relationship, and then
> handling analysis with a rough cut of SQL queries against the
> original data, Numpy/Scipy for further refinement, and Matplotlib
> to actually do the visualization.  For example, one graph was "How does
> the slope of straight line fit between measured and applied voltage vary
> as a function of temperature on each channel?"
>
> The whole process feels a bit grindy; like I keep having to do a lot of
> ad-hoc stitching things together.  And I keep hearing about pandas,
> PyTables, and HDF5.  Would that be making my life notably easier?  If
> so, does anyone have any references on it that they've found
> particularly useful?  The tutorials I've seen so far seem to not give
> much detail on what the point of what they're doing is; it's all "how
> you write the code" rather than "why you write the code".  Paying money
> for books is acceptable; this is all on the company's time/dime.
>
> Thanks,
> Rob
>

I'd start with pandas http://pandas.pydata.org/and see how you get on.

If and only if pandas isn't adequate, and I think that highly unlikely, 
try PyTables.  Quoting from http://www.pytables.org/ "PyTables is a 
package for managing hierarchical datasets and designed to efficiently 
and easily cope with extremely large amounts of data." and "PyTables is 
built on top of the HDF5 library".  I've no idea what the definition of 
"extremely large" is in this case.  How much data are you dealing with?

I don't understand your comment about tutorials.  Once they've given you 
an introduction to the tool, isn't it your responsibility to manipulate 
your data in the way that suits you?  If you can't do that, either 
you're doing something wrong, or the tool is inadequate for the task. 
For the latter I believe you've two options, find another tool or write 
your own.

I would not buy books, on the simple grounds that they go out of date 
far faster then the online docs :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#101123

From	Ravi Narasimhan <backscatter@rettacs.org>
Date	2016-01-01 14:16 -0800
Message-ID	<n66tnm$6lf$1@dont-email.me>
In reply to	#101122

On 1/1/16 1:24 PM, Mark Lawrence wrote:
 > On 31/12/2015 17:15, Rob Gaddi wrote:
 >> I'm looking for some advice on handling data collection/analysis in
 >> Python.  ...
 >> The whole process feels a bit grindy; like I keep having to do a lot of
 >> ad-hoc stitching things together.  And I keep hearing about pandas,
 >> PyTables, and HDF5.  Would that be making my life notably easier?  If
 >> so, does anyone have any references on it that they've found
 >> particularly useful?  The tutorials I've seen so far seem to not give
 >> much detail on what the point of what they're doing is; it's all "how
 >> you write the code" rather than "why you write the code".  Paying money
 >> for books is acceptable; this is all on the company's time/dime.
 >>
 >> Thanks,
 >> Rob

Cyrille Rossant's books may meet your needs. The Interactive Computing 
and Visualization Cookbook offers more than just recipes. As the topics 
get advanced, he explains the whys in addition to the hows.  It may not 
have specific answers to parameter sweep experiments but I understood 
more about Python's internals and packages as they related to my work. 
It helped me to refine when to use Python and when to use other languages.

Currently US $5 via the publisher:
https://www.packtpub.com/books/info/authors/cyrille-rossant

(I have no affiliation with the author or publisher)

Mark Lawrence writes:
 > I don't understand your comment about tutorials.  Once they've given you
 > an introduction to the tool, isn't it your responsibility to manipulate
 > your data in the way that suits you?  If you can't do that, either
 > you're doing something wrong, or the tool is inadequate for the task.
 > For the latter I believe you've two options, find another tool or write
 > your own.

Without second-guessing the OP, I've found Python tutorials and 
documents to be helpful but not always complete in a way that beginners 
and casual users would need.  There is usually a package that will do 
some job but one first has to find it.  A lot of power can also be 
located deep within a hierarchy of dots: 
package.something.subsomething.subsubsomething ...

Some documentation sets are very complete, others aren't.  I often have 
the nagging feeling that if I just knew what question to ask and knew 
the right terminology, that I could benefit from code someone has 
already written and/or develop a smarter plan of attack.

Ravi Narasimhan
http://www.rettacs.org

[toc] | [prev] | [next] | [standalone]

#101147

From	Sameer Grover <sameer.grover.1@gmail.com>
Date	2016-01-01 23:25 +0530
Message-ID	<mailman.164.1451745122.11925.python-list@python.org>
In reply to	#101066

I also collect data by sweeping multiple parameters in a similar fashion. I
find pandas very convenient for analysis.
I don't use all the features of pandas. I mainly use it for selecting
certain rows from the data, sometimes using database style merge
operations, and plotting using matplotlib. This can also be done using pure
numpy but with pandas, I don't have to keep track of all the indices

This is what my workflow is like (waarning - sloppy code):

data = pd.DataFrame(<some numpy array read from file>)
data.columns = ['temperature', 'voltage_measured', 'voltage_applied',
'channels']
for channel in data.channels.unique():
    for temperature in data.temperature.unique():
        slope = fit_slope(data[data['temperature']==temperature and
data['channels']==channel]) # fit_slope(x) -> fits x.voltage_measured and
x.voltage_applied and returns slope
        # append (channel, temperature, slope) to final plotting array etc


I imagine your database driven approach would do something similar but you
might find pandas more convenient given that it can all be done in python
and that you won't have to resort to SQL queries.

My data is small enough to get away with storing as plain text. But hdf5 is
definitely a better solution.

In addition to pytables, there is also h5py (http://www.h5py.org/). I
prefer the latter. You might like pytables because it is more database-like.

Sameer



On 31 December 2015 at 22:45, Rob Gaddi <rgaddi@highlandtechnology.invalid>
wrote:

> I'm looking for some advice on handling data collection/analysis in
> Python.  I do a lot of big, time consuming experiments in which I run a
> long data collection (a day or a weekend) in which I sweep a bunch of
> variables, then come back offline and try to cut the data into something
> that makes sense.
>
> For example, my last data collection looked (neglecting all the actual
> equipment control code in each loop) like:
>
> for t in temperatures:
>   for r in voltage_ranges:
>     for v in test_voltages[r]:
>       for c in channels:
>         for n in range(100):
>           record_data()
>
> I've been using Sqlite (through peewee) as the data backend, setting up
> a couple tables with a basically hierarchical relationship, and then
> handling analysis with a rough cut of SQL queries against the
> original data, Numpy/Scipy for further refinement, and Matplotlib
> to actually do the visualization.  For example, one graph was "How does
> the slope of straight line fit between measured and applied voltage vary
> as a function of temperature on each channel?"
>
> The whole process feels a bit grindy; like I keep having to do a lot of
> ad-hoc stitching things together.  And I keep hearing about pandas,
> PyTables, and HDF5.  Would that be making my life notably easier?  If
> so, does anyone have any references on it that they've found
> particularly useful?  The tutorials I've seen so far seem to not give
> much detail on what the point of what they're doing is; it's all "how
> you write the code" rather than "why you write the code".  Paying money
> for books is acceptable; this is all on the company's time/dime.
>
> Thanks,
> Rob
>
> --
> Rob Gaddi, Highland Technology -- www.highlandtechnology.com
> Email address domain is currently out of order.  See above to fix.
> --
> https://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [standalone]

csiph-web

Python Data Analysis Recommendations

Contents

#101066 — Python Data Analysis Recommendations

#101122

#101123

#101147