Groups > comp.lang.python > #15766 > unrolled thread

unit-profiling, similar to unit-testing

Started by	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
First post	2011-11-16 10:08 +0100
Last post	2011-11-17 21:00 -0500
Articles	7 — 4 participants

Back to article view | Back to comp.lang.python

  unit-profiling, similar to unit-testing Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2011-11-16 10:08 +0100
    Re: unit-profiling, similar to unit-testing Roy Smith <roy@panix.com> - 2011-11-16 09:36 -0500
      Re: unit-profiling, similar to unit-testing Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2011-11-17 09:53 +0100
        Re: unit-profiling, similar to unit-testing Roy Smith <roy@panix.com> - 2011-11-17 09:03 -0500
          Re: unit-profiling, similar to unit-testing "spartan.the" <spartan.the@gmail.com> - 2011-11-17 13:28 -0800
      Re: unit-profiling, similar to unit-testing Tycho Andersen <tycho@tycho.ws> - 2011-11-17 14:45 -0600
        Re: unit-profiling, similar to unit-testing Roy Smith <roy@panix.com> - 2011-11-17 21:00 -0500

#15766 — unit-profiling, similar to unit-testing

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2011-11-16 10:08 +0100
Subject	unit-profiling, similar to unit-testing
Message-ID	<95bcp8-bft.ln1@satorlaser.homedns.org>

Hi!

I'm currently trying to establish a few tests here that evaluate certain 
performance characteristics of our systems. As part of this, I found 
that these tests are rather similar to unit-tests, only that they are 
much more fuzzy and obviously dependent on the systems involved, CPU 
load, network load, day of the week (Tuesday is virus scan day) etc.

What I'd just like to ask is how you do such things. Are there tools 
available that help? I was considering using the unit testing framework, 
but the problem with that is that the results are too hard to interpret 
programmatically and too easy to misinterpret manually. Any suggestions?

Cheers!

Uli

[toc] | [next] | [standalone]

#15773

From	Roy Smith <roy@panix.com>
Date	2011-11-16 09:36 -0500
Message-ID	<roy-DBE11D.09364016112011@news.panix.com>
In reply to	#15766

In article <95bcp8-bft.ln1@satorlaser.homedns.org>,
 Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> wrote:

> Hi!
> 
> I'm currently trying to establish a few tests here that evaluate certain 
> performance characteristics of our systems. As part of this, I found 
> that these tests are rather similar to unit-tests, only that they are 
> much more fuzzy and obviously dependent on the systems involved, CPU 
> load, network load, day of the week (Tuesday is virus scan day) etc.
> 
> What I'd just like to ask is how you do such things. Are there tools 
> available that help? I was considering using the unit testing framework, 
> but the problem with that is that the results are too hard to interpret 
> programmatically and too easy to misinterpret manually. Any suggestions?

It's really, really, really hard to either control for, or accurately 
measure, things like CPU or network load.  There's so much stuff you 
can't even begin to see.  The state of your main memory cache.  Disk 
fragmentation.  What I/O is happening directly out of kernel buffers vs 
having to do a physical disk read.  How slow your DNS server is today.

What I suggest is instrumenting your unit test suite to record not just 
the pas/fail status of every test, but also the test duration.  Stick 
these into a database as the tests run.  Over time, you will accumulate 
a whole lot of performance data, which you can then start to mine.

While you're running the tests, gather as much system performance data 
as you can (output of top, vmstat, etc) and stick that into your 
database too.  You never know when you'll want to refer to the data, so 
just collect it all and save it forever.

[toc] | [prev] | [next] | [standalone]

#15812

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2011-11-17 09:53 +0100
Message-ID	<kkuep8-nqd.ln1@satorlaser.homedns.org>
In reply to	#15773

Am 16.11.2011 15:36, schrieb Roy Smith:
> It's really, really, really hard to either control for, or accurately
> measure, things like CPU or network load.  There's so much stuff you
> can't even begin to see.  The state of your main memory cache.  Disk
> fragmentation.  What I/O is happening directly out of kernel buffers vs
> having to do a physical disk read.  How slow your DNS server is today.

Fortunately, I am in a position where I'm running tests on one system 
(generic desktop PC) while the system to test is another one, and there 
both hardware and software is under my control. Since this is rather 
smallish and embedded, the power and load of the desktop don't play a 
significant role, the other side is usually the bottleneck. ;)


> What I suggest is instrumenting your unit test suite to record not just
> the pas/fail status of every test, but also the test duration.  Stick
> these into a database as the tests run.  Over time, you will accumulate
> a whole lot of performance data, which you can then start to mine.

I'm not sure. I see unit tests as something that makes sure things run 
correctly. For performance testing, I have functions to set up and tear 
down the environment. Then, I found it useful to have separate code to 
prime a cache, which is something done before each test run, but which 
is not part of the test run itself. I'm repeating each test run N times, 
recording the times and calculating maximum, minimum, average and 
standard deviation. Some of this is similar to unit testing (code to set 
up/tear down), but other things are too different. Also, sometimes I can 
vary tests with a factor F, then I would also want to capture the 
influence of this factor. I would even wonder if you can't verify the 
behaviour agains an expected Big-O complexity somehow.

All of this is rather general, not specific to my use case, hence my 
question if there are existing frameworks to facilitate this task. Maybe 
it's time to create one...


> While you're running the tests, gather as much system performance data
> as you can (output of top, vmstat, etc) and stick that into your
> database too.  You never know when you'll want to refer to the data, so
> just collect it all and save it forever.

Yes, this is surely something that is necessary, in particular since 
there are no clear success/failure outputs like for unit tests and they 
require a human to interpret them.


Cheers!

Uli

[toc] | [prev] | [next] | [standalone]

#15818

From	Roy Smith <roy@panix.com>
Date	2011-11-17 09:03 -0500
Message-ID	<roy-56C820.09031517112011@news.panix.com>
In reply to	#15812

In article <kkuep8-nqd.ln1@satorlaser.homedns.org>,
 Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> wrote:

> Yes, this is surely something that is necessary, in particular since 
> there are no clear success/failure outputs like for unit tests and they 
> require a human to interpret them.

As much as possible, you want to automate things so no human 
intervention is required.

For example, let's say you have a test which calls foo() and times how 
long it takes.  You've already mentioned that you run it N times and 
compute some basic (min, max, avg, sd) stats.  So far, so good.

The next step is to do some kind of regression against past results.  
Once you've got a bunch of historical data, it should be possible to 
look at today's numbers and detect any significant change in performance.

Much as I loathe the bureaucracy and religious fervor which has grown up 
around Six Sigma, it does have some good tools.  You might want to look 
into control charts (http://en.wikipedia.org/wiki/Control_chart).  You 
think you've got the test environment under control, do you?  Try 
plotting a month's worth of run times for a particular test on a control 
chart and see what it shows.

Assuming your process really is under control, I would write scripts 
that did the following kinds of analysis:

1) For a given test, do a linear regression of run time vs date.  If the 
line has any significant positive slope, you want to investigate why.

2) You already mentioned, "I would even wonder if you can't verify the 
behaviour agains an expected Big-O complexity somehow".  Of course you 
can.  Run your test a bunch of times with different input sizes.  I 
would try something like a 1-2-5 progression over several decades (i.e. 
input sizes of 10, 20, 50, 100, 200, 500, 1000, etc)  You will have to 
figure out what an appropriate range is, and how to generate useful 
input sets.  Now, curve fit your performance numbers to various shape 
curves and see what correlation coefficient you get.

All that being said, in my experience, nothing beats plotting your data 
and looking at it.

[toc] | [prev] | [next] | [standalone]

#15836

From	"spartan.the" <spartan.the@gmail.com>
Date	2011-11-17 13:28 -0800
Message-ID	<fa20036c-aeda-4306-90cb-d30283f10fb9@k10g2000yqn.googlegroups.com>
In reply to	#15818

On Nov 17, 4:03 pm, Roy Smith <r...@panix.com> wrote:
> In article <kkuep8-nqd....@satorlaser.homedns.org>,
>  Ulrich Eckhardt <ulrich.eckha...@dominolaser.com> wrote:
>
> > Yes, this is surely something that is necessary, in particular since
> > there are no clear success/failure outputs like for unit tests and they
> > require a human to interpret them.
>
> As much as possible, you want to automate things so no human
> intervention is required.
>
> For example, let's say you have a test which calls foo() and times how
> long it takes.  You've already mentioned that you run it N times and
> compute some basic (min, max, avg, sd) stats.  So far, so good.
>
> The next step is to do some kind of regression against past results.
> Once you've got a bunch of historical data, it should be possible to
> look at today's numbers and detect any significant change in performance.
>
> Much as I loathe the bureaucracy and religious fervor which has grown up
> around Six Sigma, it does have some good tools.  You might want to look
> into control charts (http://en.wikipedia.org/wiki/Control_chart).  You
> think you've got the test environment under control, do you?  Try
> plotting a month's worth of run times for a particular test on a control
> chart and see what it shows.
>
> Assuming your process really is under control, I would write scripts
> that did the following kinds of analysis:
>
> 1) For a given test, do a linear regression of run time vs date.  If the
> line has any significant positive slope, you want to investigate why.
>
> 2) You already mentioned, "I would even wonder if you can't verify the
> behaviour agains an expected Big-O complexity somehow".  Of course you
> can.  Run your test a bunch of times with different input sizes.  I
> would try something like a 1-2-5 progression over several decades (i.e.
> input sizes of 10, 20, 50, 100, 200, 500, 1000, etc)  You will have to
> figure out what an appropriate range is, and how to generate useful
> input sets.  Now, curve fit your performance numbers to various shape
> curves and see what correlation coefficient you get.
>
> All that being said, in my experience, nothing beats plotting your data
> and looking at it.

I strongly agree with Roy, here.

Ulrich, I recommend you to explore how google measures appengine's
health here: http://code.google.com/status/appengine.

Unit tests are inappropriate here; any single unit test can answer
PASS or FAIL, YES or NO. It can't answer the question "how much".
Unless you just want to use unit tests. Then any arguments here just
don't make sense.

I suggest:

1. Decide what you want to measure. Measure result must be a number in
range (0..100, -5..5), so you can plot them.
2. Write no-UI programs to get each number (measure) and write it to
CSV. Run each of them several times take away 1 worst and 1 best
result, and take and average number.
3. Collect the data for some period of time.
4. Plot those average number over time axis (it's easy with CSV
format).
5. Make sure you automate this process (batch files or so) so the plot
is generated automatically each hour or each day.

And then after a month you can decide if you want to divide your
number ranges into green-yellow-red zones. More often than not you may
find that your measures are so inaccurate and random that you can't
trust them. Then you'll either forget that or dive into math
(statistics). You have about 5% chances to succeed ;)

[toc] | [prev] | [next] | [standalone]

#15833

From	Tycho Andersen <tycho@tycho.ws>
Date	2011-11-17 14:45 -0600
Message-ID	<mailman.2810.1321562763.27778.python-list@python.org>
In reply to	#15773

On Wed, Nov 16, 2011 at 09:36:40AM -0500, Roy Smith wrote:
> In article <95bcp8-bft.ln1@satorlaser.homedns.org>,
>  Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> wrote:
> 
> > Hi!
> > 
> > I'm currently trying to establish a few tests here that evaluate certain 
> > performance characteristics of our systems. As part of this, I found 
> > that these tests are rather similar to unit-tests, only that they are 
> > much more fuzzy and obviously dependent on the systems involved, CPU 
> > load, network load, day of the week (Tuesday is virus scan day) etc.
> > 
> > What I'd just like to ask is how you do such things. Are there tools 
> > available that help? I was considering using the unit testing framework, 
> > but the problem with that is that the results are too hard to interpret 
> > programmatically and too easy to misinterpret manually. Any suggestions?
> 
> It's really, really, really hard to either control for, or accurately 
> measure, things like CPU or network load.  There's so much stuff you 
> can't even begin to see.  The state of your main memory cache.  Disk 
> fragmentation.  What I/O is happening directly out of kernel buffers vs 
> having to do a physical disk read.  How slow your DNS server is today.

While I agree there's a lot of things you can't control for, you can
get a more accurate picture by using CPU time instead of wall time
(e.g. the clock() system call). If what you care about is mostly CPU
time, you can control for the "your disk is fragmented", "your DNS
server died", or "my cow-orker was banging on the test machine" this
way.

\t

[toc] | [prev] | [next] | [standalone]

#15846

From	Roy Smith <roy@panix.com>
Date	2011-11-17 21:00 -0500
Message-ID	<roy-5F44F7.21000017112011@news.panix.com>
In reply to	#15833

In article <mailman.2810.1321562763.27778.python-list@python.org>,
 Tycho Andersen <tycho@tycho.ws> wrote:

> While I agree there's a lot of things you can't control for, you can
> get a more accurate picture by using CPU time instead of wall time
> (e.g. the clock() system call). If what you care about is mostly CPU
> time [...]

That's a big if.  In some cases, CPU time is important, but more often, 
wall-clock time is more critical.  Let's say I've got two versions of a 
program.  Here's some results for my test run:

Version     CPU Time     Wall-Clock Time
   1         2 hours       2.5 hours
   2         1.5 hours     5.0 hours

Between versions, I reduced the CPU time to complete the given task, but 
increased the wall clock time.  Perhaps I doubled the size of some hash 
table.  Now I get a lot fewer hash collisions (so I spend less CPU time 
re-hashing), but my memory usage went up so I'm paging a lot and my 
locality of reference went down so my main memory cache hit rate is 
worse.

Which is better?  I think most people would say version 1 is better.

CPU time is only important in a situation where the system is CPU bound.  
In many real-life cases, that's not at all true.  Things can be memory 
bound.  Or I/O bound (which, when you consider paging, is often the same 
thing as memory bound).  Or lock-contention bound.

Before you starting measuring things, it's usually a good idea to know 
what you want to measure, and why :-)

[toc] | [prev] | [standalone]

csiph-web

unit-profiling, similar to unit-testing

Contents

#15766 — unit-profiling, similar to unit-testing

#15773

#15812

#15818

#15836

#15833

#15846