Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5952 > unrolled thread
| Started by | John Ladasky <ladasky@my-deja.com> |
|---|---|
| First post | 2011-05-21 20:58 -0700 |
| Last post | 2011-05-23 16:46 -0400 |
| Articles | 6 — 3 participants |
Back to article view | Back to comp.lang.python
Multiprocessing: don't push the pedal to the metal? John Ladasky <ladasky@my-deja.com> - 2011-05-21 20:58 -0700
Re: Multiprocessing: don't push the pedal to the metal? John Ladasky <ladasky@my-deja.com> - 2011-05-22 14:06 -0700
Re: Multiprocessing: don't push the pedal to the metal? Chris Angelico <rosuav@gmail.com> - 2011-05-23 10:32 +1000
Re: Multiprocessing: don't push the pedal to the metal? Adam Tauno Williams <awilliam@whitemice.org> - 2011-05-23 05:50 -0400
Re: Multiprocessing: don't push the pedal to the metal? John Ladasky <ladasky@my-deja.com> - 2011-05-23 12:51 -0700
Re: Multiprocessing: don't push the pedal to the metal? Adam Tauno Williams <awilliam@whitemice.org> - 2011-05-23 16:46 -0400
| From | John Ladasky <ladasky@my-deja.com> |
|---|---|
| Date | 2011-05-21 20:58 -0700 |
| Subject | Multiprocessing: don't push the pedal to the metal? |
| Message-ID | <644e4768-0fee-4c40-ba59-4b777b883884@z13g2000prk.googlegroups.com> |
Hello again, everyone. I'm developing some custom neural network code. I'm using Python 2.6, Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core CPU. About six weeks ago, I asked some questions about multiprocessing in Python, and I got some very helpful responses from you all. http://groups.google.com/group/comp.lang.python/browse_frm/thread/374e1890efbcc87b Now I'm back with a new question. I have gotten comfortable with cProfile, and with multiprocessing's various Queues (I've graduated from Pool). I just ran some extensive tests of my newest code, and I've learned some surprising things. I have a pretty picture here (be sure to view the full-size image): http://www.flickr.com/photos/15579975@N00/5744093219 I'll quickly ask my question first, to avoid a TL;DR problem: when you have a multi-core CPU with N cores, is it common to see the performance peak at N-1, or even N-2 processes? And so, should you avoid using quite as many processes as there are cores? I was expecting diminishing returns for each additional core -- but not outright declines. That's what I think my data shows for many of my trial runs. I've tried running this test twice. Once, I was reading a few PDFs and web pages while my speed test was running. But even when I wasn't using the computer for these other (light) tasks, I saw the same performance drops. Perhaps this is due to the OS overhead? The load average on my system monitor looks pretty quiet to me when I'm not running my program. OK, if you care to read further, here's some more detail... My graphs show the execution times of my neural network evaluation routine as a function of: - the size of my neural network (six sizes were tried -- with varying numbers of inputs, outputs and hidden nodes), - the subprocess configuration (either not using a subprocess, or using 1-6 subprocesses), and - the size of the input data vector (from 7 to 896 sets of inputs -- I'll explain the rationale for the exact numbers I chose if anyone cares to know). Each graph is normalized to the execution time that running the evaluation routine takes on a single CPU, without invoking a subprocess. Obviously, I'm looking for the conditions which yield performance gains above that baseline. (I'll be running this particular piece of code millions of times!) I tried 200 repetitions for each combination network size, input data size, and number of CPU cores. Even so, there was substantial irregularity in the timing graphs. So, rather than connecting the dots directly, which would lead to some messy crossing lines which are a bit hard to read, I fit B-spline curves to the data. As I anticipated, there is a performance penalty that is incurred just for parceling out the data to the multiple processes and collating the results at the end. When the data set is small, it's faster to send it to a single CPU, without invoking a subprocess. In fact, dividing a small task among 3 processes can underperform a two-process approach, and so on! See the leftmost two panels in the top row, and the rightmost two panels in the bottom row. When the networks increase in complexity, the size of the data set for which break-even performance is achieved drops accordingly. I'm more concerned about optimizing these bigger problems, obviously, because they take the longest to run. What I did not anticipate was finding that performance reversal with added computing power for large data sets. Comments are appreciated!
[toc] | [next] | [standalone]
| From | John Ladasky <ladasky@my-deja.com> |
|---|---|
| Date | 2011-05-22 14:06 -0700 |
| Message-ID | <0211797f-e130-4bfa-bcb0-f701ec33c7b9@17g2000prr.googlegroups.com> |
| In reply to | #5952 |
Following up to my own post... Flickr informs me that quite a few of you have been looking at my graphs of performance vs. the number of sub-processes employed in a parallelizable task: On May 21, 8:58 pm, John Ladasky <lada...@my-deja.com> wrote: > http://www.flickr.com/photos/15579975@N00/5744093219 [...] > I'll quickly ask my question first, to avoid a TL;DR problem: when you > have a multi-core CPU with N cores, is it common to see the > performance peak at N-1, or even N-2 processes? And so, should you > avoid using quite as many processes as there are cores? I was > expecting diminishing returns for each additional core -- but not > outright declines. But no one has offered any insight yet? Well, I slept on it, and I had a thought. Please feel free to shoot it down. If I spawn N worker sub-processes, my application in fact has N+1 processes in all, because there's also the master process itself. If the master process has anything significant to do (and mine does, and I would surmise that many multi-core applications would be that way), then the master process may sometimes find itself competing for time on a CPU core with a worker sub-process. This could impact performance even when the demands from the operating system and/or other applications are modest. I'd still appreciate hearing from anyone else who has more experience with multiprocessing. If there are general rules about how to do this best, I haven't seen them posted anywhere. This may not be a Python- specific issue, of course. Tag, you're it!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-05-23 10:32 +1000 |
| Message-ID | <mailman.1944.1306110755.9059.python-list@python.org> |
| In reply to | #6006 |
On Mon, May 23, 2011 at 7:06 AM, John Ladasky <ladasky@my-deja.com> wrote: > If I spawn N worker sub-processes, my application in fact has N+1 > processes in all, because there's also the master process itself. This would definitely be correct. How much impact the master process has depends on how much it's doing. > I'd still appreciate hearing from anyone else who has more experience > with multiprocessing. If there are general rules about how to do this > best, I haven't seen them posted anywhere. This may not be a Python- > specific issue, of course. I don't have much experience with Python's multiprocessing model, but I've done concurrent programming on a variety of platforms, and there are some common issues. Each CPU (or core) has its own execution cache. If you can keep one thread running on the same core all the time, it will benefit more from that cache than if it has to keep flitting from one to another. You undoubtedly will have other processes in the system, too. As well as your master, there'll be processes over which you have no control (unless you're on a bare-bones system). Some of them may preempt your processes. Leaving one CPU/core available for "everything else" may allow the OS to keep each thread on its own core. Having as many workers as cores means that every time there's something else to do, one of your workers has to be kicked off its nice warm CPU and sent out into the cold for a while. If all your workers are at the same priority, it will then grab a timeslice off one of the other cores, kicking its incumbent off... rinse and repeat. This is a tradeoff, though. If the rest of your system is going to use 0.01 of a core, then 1% thrashing is worth having one more core available 99% of the time. If the numbers are reversed, it's equally obvious that you should leave one core available. In your case, it's probably turning out that the contention causes more overhead than the extra worker is worth. That's just some general concepts, without an in-depth analysis of your code and your entire system. It's probably easier to analyse by results rather than inspection. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Adam Tauno Williams <awilliam@whitemice.org> |
|---|---|
| Date | 2011-05-23 05:50 -0400 |
| Message-ID | <mailman.1967.1306145336.9059.python-list@python.org> |
| In reply to | #6006 |
On Mon, 2011-05-23 at 10:32 +1000, Chris Angelico wrote: > On Mon, May 23, 2011 at 7:06 AM, John Ladasky <ladasky@my-deja.com> wrote: > > If I spawn N worker sub-processes, my application in fact has N+1 > > processes in all, because there's also the master process itself. > > I'd still appreciate hearing from anyone else who has more experience > > with multiprocessing. If there are general rules about how to do this > > best, I haven't seen them posted anywhere. This may not be a Python- > > specific issue, of course. > I don't have much experience with Python's multiprocessing model, but > I've done concurrent programming on a variety of platforms, and there > are some common issues. I develop an app that uses multiprocessing heavily. Remember that all these processes are processes - so you can use all the OS facilities regarding processes on them. This includes setting nice values, schedular options, CPU pinning, etc... > Each CPU (or core) has its own execution cache. If you can keep one > thread running on the same core all the time, it will benefit more > from that cache than if it has to keep flitting from one to another. +1 > You undoubtedly will have other processes in the system, too. As well > as your master, there'll be processes over which you have no control > (unless you're on a bare-bones system). Some of them may preempt your > processes. This is very true. You get a benefit from dividing work up to the correct number of processes - but too many processes will quickly take back all the benefit. One good trick is to have the parent monitor the load average and only spawn additional workers when that value is below a certain value.
[toc] | [prev] | [next] | [standalone]
| From | John Ladasky <ladasky@my-deja.com> |
|---|---|
| Date | 2011-05-23 12:51 -0700 |
| Message-ID | <84e77cd9-9003-4053-a231-fe7d4bdbfef2@k15g2000pri.googlegroups.com> |
| In reply to | #6057 |
On May 23, 2:50 am, Adam Tauno Williams <awill...@whitemice.org> wrote: > I develop an app that uses multiprocessing heavily. Remember that all > these processes are processes - so you can use all the OS facilities > regarding processes on them. This includes setting nice values, > schedular options, CPU pinning, etc... That's interesting. Does code exist in the Python library which allows the adjustment of CPU pinning and nice levels? I just had another look at the multiprocessing docs, and also at os.subprocess. I didn't see anything that pertains to these issues. > > Each CPU (or core) has its own execution cache. If you can keep one > > thread running on the same core all the time, it will benefit more > > from that cache than if it has to keep flitting from one to another. > > +1
[toc] | [prev] | [next] | [standalone]
| From | Adam Tauno Williams <awilliam@whitemice.org> |
|---|---|
| Date | 2011-05-23 16:46 -0400 |
| Message-ID | <mailman.1987.1306183679.9059.python-list@python.org> |
| In reply to | #6091 |
On Mon, 2011-05-23 at 12:51 -0700, John Ladasky wrote:
> On May 23, 2:50 am, Adam Tauno Williams <awill...@whitemice.org>
> wrote:
> > I develop an app that uses multiprocessing heavily. Remember that all
> > these processes are processes - so you can use all the OS facilities
> > regarding processes on them. This includes setting nice values,
> > schedular options, CPU pinning, etc...
> That's interesting. Does code exist in the Python library which
> allows the adjustment of CPU pinning and nice levels? I just had
> another look at the multiprocessing docs, and also at os.subprocess.
> I didn't see anything that pertains to these issues.
"in the Python library" - no. All these types of behaviors are platform
specific.
For example you can set the "nice" (priority) of a UNIX/LINUX process
using the nice method from the os module. Our workflow engine does this
on all worker processes it starts - it sends the workers to the lowest
priority.
from os import nice as os_priority
...
try:
os_priority(20)
except Exception, e:
...
I'm not aware of a tidy way to call sched_setaffinity from Python; but
my own testing indicates that the LINUX kernel is very good at figuring
this out on its own so long as it isn't swamped. Queuing, rather than
starting, additional workflows if load average exceeds X.Y and setting
the process priority of workers to very-low seems to work very well.
There is <http://pypi.python.org/pypi/affinity> for setting affinity,
but I haven't used it.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web