Groups > comp.lang.python > #60169 > unrolled thread

Traceback when using multiprocessing, less than helpful?

Started by	John Ladasky <john_ladasky@sbcglobal.net>
First post	2013-11-21 09:01 -0800
Last post	2013-11-22 09:09 +0000
Articles	13 — 5 participants

Back to article view | Back to comp.lang.python

  Traceback when using multiprocessing, less than helpful? John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-21 09:01 -0800
    Re: Traceback when using multiprocessing, less than helpful? Chris Angelico <rosuav@gmail.com> - 2013-11-22 04:24 +1100
      Re: Traceback when using multiprocessing, less than helpful? John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-21 10:25 -0800
        Re: Traceback when using multiprocessing, less than helpful? Chris Angelico <rosuav@gmail.com> - 2013-11-22 07:53 +1100
          Re: Traceback when using multiprocessing, less than helpful? John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-21 13:19 -0800
            Re: Traceback when using multiprocessing, less than helpful? John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-21 13:49 -0800
              Re: Traceback when using multiprocessing, less than helpful? Ethan Furman <ethan@stoneleaf.us> - 2013-11-21 14:32 -0800
    Re: Traceback when using multiprocessing, less than helpful? Terry Reedy <tjreedy@udel.edu> - 2013-11-21 17:37 -0500
    Re: Traceback when using multiprocessing, less than helpful? John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-21 19:57 -0800
      Re: Traceback when using multiprocessing, less than helpful? Chris Angelico <rosuav@gmail.com> - 2013-11-22 15:24 +1100
        Why pickling (was: Traceback when using multiprocessing) John Ladasky <john_ladasky@sbcglobal.net> - 2013-11-22 08:38 -0800
          Re: Why pickling (was: Traceback when using multiprocessing) Chris Angelico <rosuav@gmail.com> - 2013-11-23 10:50 +1100
      Re: Traceback when using multiprocessing, less than helpful? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-22 09:09 +0000

#60169 — Traceback when using multiprocessing, less than helpful?

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-21 09:01 -0800
Subject	Traceback when using multiprocessing, less than helpful?
Message-ID	<e92311cb-6cc5-415a-bbf8-544c0c9c6a54@googlegroups.com>

Hi folks,

Somewhat over a year ago, I struggled with implementing a routine using multiprocessing.Pool and numpy. I eventually succeeded, but I remember finding it very hard to debug. Now I have managed to provoke an error from that routine again, and once again, I'm struggling.

Here is the end of the traceback, starting with the last line of my code: "result = pool.map(evaluate, bundles)". After that, I'm into Python itself.

File ".../evaluate.py", line 81, in evaluate
result = pool.map(evaluate, bundles)
File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
raise self._value
ValueError: operands could not be broadcast together with shapes (1,3) (4)

Notice that no line of numpy appears in the traceback? Still, there are three things that make me think that this error is coming from numpy.

1. "raise self._value" means that an exception is stored in a variable, to be re-raised.

2. The words "operands" and "broadcast" do not appear anywhere in the source code of multiprocessing.pool.

3. The words "operands" and "broadcast" are common to numpy errors I have seen before. Numpy does many very tricky things when dealing with arrays of different dimensions and shapes.

Of course, I am sure that the bug must be in my own code. I even have old programs which are using my evaluate.evaluate() without generating errors. I am comparing the data structures that my working and my non-working programs send to pool.map(). I am comparing the code between my two programs. There is some subtle difference that I haven't spotted.

If I could only see the line of numpy code which is generating the ValueError, I would have a better chance of spotting the bug in my code. So, WHY isn't there any reference to numpy in my traceback?

Here's my theory. The numpy error was generated in a subprocess. The line "raise self._value" is intercepting the exception generated by my subprocess, and passing it back to the master Python interpreter.

Does re-raising an exception, and/or passing an exception from a subprocess, truncate a traceback? That's what I think I'm seeing.

Thanks for any advice!

[toc] | [next] | [standalone]

#60171

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-22 04:24 +1100
Message-ID	<mailman.3013.1385054683.18130.python-list@python.org>
In reply to	#60169

On Fri, Nov 22, 2013 at 4:01 AM, John Ladasky
<john_ladasky@sbcglobal.net> wrote:
> Here is the end of the traceback, starting with the last line of my code: "result = pool.map(evaluate, bundles)".  After that, I'm into Python itself.
>
>   File ".../evaluate.py", line 81, in evaluate
>     result = pool.map(evaluate, bundles)
>   File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
>     return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
>     raise self._value
> ValueError: operands could not be broadcast together with shapes (1,3) (4)
>
> Notice that no line of numpy appears in the traceback?  Still, there are three things that make me think that this error is coming from numpy.

Hmm. This looks like a possible need for the 'raise from' syntax. I
just checked multiprocessing/pool.py from 3.4 alpha, and it has much
what you're seeing there, in the definition of AsyncResult (of which
MapResult is a subclass). The question is, though, how well does the
information traverse the process boundary?

ChrisA

[toc] | [prev] | [next] | [standalone]

#60172

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-21 10:25 -0800
Message-ID	<9eb71131-7ca0-4a21-a8f3-98371ee8787e@googlegroups.com>
In reply to	#60171

On Thursday, November 21, 2013 9:24:33 AM UTC-8, Chris Angelico wrote:

> Hmm. This looks like a possible need for the 'raise from' syntax. 

Thank you, Chris, that made me feel like a REAL Python programmer -- I just did some reading, and the "raise from" feature was not implemented until Python 3!  And I might actually need it!  :^)

I think that the article http://www.python.org/dev/peps/pep-3134/ is relevant.  Reading it now.  To be clear: the complete exception change is stored in every class, it's just not being displayed?  I hope that's the case.  I shouldn't have to install a "raise from" hook in multiprocessing.map_async itself.

[toc] | [prev] | [next] | [standalone]

#60173

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-22 07:53 +1100
Message-ID	<mailman.3014.1385067196.18130.python-list@python.org>
In reply to	#60172

On Fri, Nov 22, 2013 at 5:25 AM, John Ladasky
<john_ladasky@sbcglobal.net> wrote:
> On Thursday, November 21, 2013 9:24:33 AM UTC-8, Chris Angelico wrote:
>
>> Hmm. This looks like a possible need for the 'raise from' syntax.
>
> Thank you, Chris, that made me feel like a REAL Python programmer -- I just did some reading, and the "raise from" feature was not implemented until Python 3!  And I might actually need it!  :^)
>
> I think that the article http://www.python.org/dev/peps/pep-3134/ is relevant.  Reading it now.  To be clear: the complete exception change is stored in every class, it's just not being displayed?  I hope that's the case.  I shouldn't have to install a "raise from" hook in multiprocessing.map_async itself.
>

That PEP is all about the 'raise from' notation, yes; but the
exception chaining is presumably not being stored, or else you would
be able to see it in the default printout. So the best solution to
this is, most likely, a patch to multiprocessing to have it chain
exceptions properly. I think that would be considered a bugfix, and
thus back-ported to all appropriate versions (rather than a feature
enhancement that goes in 3.4 or 3.5 only).

What you could try is printing out the __cause__ and __context__ of
the exception, to see if there's anything useful in them; if there's
nothing, the next thing to try would be some kind of wrapper in your
inner handler (the evaluate function) that retains additional
information.

Oh, something else to try: It might be that the proper exception
chaining would happen, except that the info isn't traversing processes
properly due to pickling or something. Can you patch your code to use
threading instead of multiprocessing? That might reveal something.
(Don't worry about abysmal performance at this stage.)

Hopefully someone with more knowledge of Python's internals can help
out, here. One way or another, I suspect this will result in a tracker
issue.

ChrisA

[toc] | [prev] | [next] | [standalone]

#60175

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-21 13:19 -0800
Message-ID	<46e03756-a242-4cdd-a5c0-30fcff34c98c@googlegroups.com>
In reply to	#60173

On Thursday, November 21, 2013 12:53:07 PM UTC-8, Chris Angelico wrote:

> What you could try is 

Suggestion 1:

> printing out the __cause__ and __context__ of 
> the exception, to see if there's anything useful in them; 

Suggestion 2:

> if there's
> nothing, the next thing to try would be some kind of wrapper in your
> inner handler (the evaluate function) that retains additional
> information.

Suggestion 3:

> Oh, something else to try: It might be that the proper exception
> chaining would happen, except that the info isn't traversing processes
> properly due to pickling or something. Can you patch your code to use
> threading instead of multiprocessing? That might reveal something.
> (Don't worry about abysmal performance at this stage.)

I have tried the first suggestion, at the top level of my code.  Here are the modified lines, and the output:

==============================================

try:
    out = evaluate(net, domain)
except ValueError as e:
    print(type(e))
    print(e) # this just produces the exception string itself
    print(e.__context__)
    print(e.__cause__)
    raise e # just so my program actually stops

==============================================

<class 'ValueError'>
operands could not be broadcast together with shapes (1,3) (4) 
None
None

==============================================

So, once I catch the exception, both __context__ and __cause__ are undefined.

I will proceed as you have suggested -- but if anything comes to mind based on what I have already done, please feel free to chime in!

[toc] | [prev] | [next] | [standalone]

#60176

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-21 13:49 -0800
Message-ID	<e90dbd18-97ae-4e92-b521-5818d015244a@googlegroups.com>
In reply to	#60175

Followup:

I didn't need to go as far as Chris Angelico's second suggestion.  I haven't looked at certain parts of my own code for a while, but it turns out that I wrote it REASONABLY logically...

My evaluate() calls another function through pool.map_async() -- _evaluate(), which actually processes the data, on a single CPU.  So I didn't need to hassle with threading, as Chris suggested.  All I did was to import _evaluate in my top-level code, then change my function calls from evaluate() to _evaluate().  Out popped my numpy error, with a proper traceback.  I can now debug it!

I can probably refactor my code to make it even cleaner.  I'll have to deal with the fact that pool.map() requires that all arguments to each subprocess be submitted as a single, iterable object.  I didn't want to have to do this when I only had a single process to run, but perhaps the tradeoff will be acceptable.

So now, for anyone who is still reading this: is it your opinion that the traceback that I obtained through multiprocessing.pool._map_async().get() SHOULD have allowed me to see what the ultimate cause of the exception was?  I think so.  Is it a bug?  Should I request a bugfix?  How do I go about doing that?

[toc] | [prev] | [next] | [standalone]

#60177

From	Ethan Furman <ethan@stoneleaf.us>
Date	2013-11-21 14:32 -0800
Message-ID	<mailman.3016.1385073129.18130.python-list@python.org>
In reply to	#60176

On 11/21/2013 01:49 PM, John Ladasky wrote:
>
> So now, for anyone who is still reading this: is it your
> opinion that the traceback that I obtained through
>  multiprocessing.pool._map_async().get() SHOULD have allowed
>  me to see what the ultimate cause of the exception was?

It would certainly be nice.

> I think so.  Is it a bug?  Should I request a bugfix?  How
> do I go about doing that?

Check out bugs.python.org.  Search for multiprocessing and tracebacks to see if anything is already there; if not, 
create a new issue.

--
~Ethan~

[toc] | [prev] | [next] | [standalone]

#60178

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-11-21 17:37 -0500
Message-ID	<mailman.3017.1385073452.18130.python-list@python.org>
In reply to	#60169

On 11/21/2013 12:01 PM, John Ladasky wrote:

This is a case where you need to dig into the code (or maybe docs) a bit

> File ".../evaluate.py", line 81, in evaluate
 >   result = pool.map(evaluate, bundles) File
> "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
 >   return self._map_async(func, iterable, mapstar, chunksize).get()

The call to _map_async gets a blank MapResult (a subclass of 
ApplyResult), queues tasks to fill it in, and returns the filled in 
result. This call is designed to always return as task exceptions are 
caught and assigned to MapResult._value in both ApplyResult._set and 
MapResult._set.

result = MapResult(self._cache, chunksize, len(iterable), callback,
                            error_callback=error_callback)
self._taskqueue.put((((result._job, i, mapper, (x,), {})
                            for i, x in enumerate(task_batches)), None))
return result

It is the subsequent call to get() that 'fails', because it raises
the caught exception.

 > File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
 >   raise self._value

ValueError: operands could not be broadcast together with shapes (1,3) (4)

> Notice that no line of numpy appears in the traceback?  Still, there
> are three things that make me think that this error is coming from
> numpy.

It comes from one of your tasks as the 'result', and your tasks use numpy.

> If I could only see the line of numpy code which is generating the
> ValueError, I would have a better chance of spotting the bug in my
> code.

Definitely.

 > So, WHY isn't there any reference to numpy in my traceback?

I suspect that raising the exception may replace its __traceback__ 
attribute.  Anyway, there are three things I might try.

1. Use 3.3.3 or latest 3.4 to see if there is any improvement in output. 
I vaguely remember a tracker issue that might be related.

2. _map_async takes an error_callback arg that defaults to None and 
which is passed on to MapResult. When _value is set to an exception, 
"error_callback(_value)" is called in ._set() before the later .get() 
re-raises it. pool.map does not allow you to set either the (success) 
callback or the error_callback, but pool.map_async does (this is the 
difference between the two methods). So switch to the latter so you can 
pass a function that uses the traceback module to print (or log) the 
traceback attached to _value, assuming that there is one.

3. If that does not work, wrap the current body of your task function in
try: <current suite>
except exception as e:
   <use traceback module to add traceback to message>
   raise e <or a new exception>

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#60196

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-21 19:57 -0800
Message-ID	<081af7df-2330-4b8b-abbf-4707edfcc17a@googlegroups.com>
In reply to	#60169

On Thursday, November 21, 2013 2:32:08 PM UTC-8, Ethan Furman wrote:
> Check out bugs.python.org.  Search for multiprocessing and tracebacks to see 
> if anything is already there; if not, create a new issue.

And on Thursday, November 21, 2013 2:37:13 PM UTC-8, Terry Reedy wrote:

> 1. Use 3.3.3 or latest 3.4 to see if there is any improvement in output.  
> I vaguely remember a tracker issue that might be related.

All right, there appear to be two recent bug reports which are relevant.

http://bugs.python.org/issue13831
http://bugs.python.org/issue17836

The comments in the first link, from Richard Oudkerk, appear to indicate that pickling an Exception (so that it can be sent between processes) is difficult, perhaps impossible.  I have never completely understood what can be pickled, and what cannot -- or, for that matter, why data needs to be pickled to pass it between processes. 

In any case, a string representation of the traceback can be pickled.  For debugging purposes, that can still help.  So, if I understand everything correctly, in this link...

http://hg.python.org/cpython/rev/c4f92b597074/

...Richard submits his "hack" (his description) to Python 3.4 which pickles and passes the string.  When time permits, I'll try it out.  Or maybe I'll wait, since Python 3.4.0 is still in alpha.

[toc] | [prev] | [next] | [standalone]

#60200

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-22 15:24 +1100
Message-ID	<mailman.3025.1385094254.18130.python-list@python.org>
In reply to	#60196

On Fri, Nov 22, 2013 at 2:57 PM, John Ladasky
<john_ladasky@sbcglobal.net> wrote:
> or, for that matter, why data needs to be pickled to pass it between processes.

Oh, that part's easy. Let's leave the multiprocessing module out of it
for the moment; imagine you spin up two completely separate instances
of Python. Create some object in one of them; now, transfer it to the
other. How are you going to do it?

Ultimately, the operating system isn't going to give you facilities
for moving complex objects around - what you almost exclusively get is
streams of bytes (or occasionally messaged chunks with lengths, but
still of bytes). Pickling is one method of turning an object into a
stream of bytes, in such a way that it can be turned back into an
equivalent object on the other side. And therein is the problem with
exceptions; since the traceback includes references to stack frames
and such, it's not as simple as saying "Two to beam up" and hearing
the classic sound effect - somehow you need to transfer all the
appropriate information across processes.

ChrisA

[toc] | [prev] | [next] | [standalone]

#60240 — Why pickling (was: Traceback when using multiprocessing)

From	John Ladasky <john_ladasky@sbcglobal.net>
Date	2013-11-22 08:38 -0800
Subject	Why pickling (was: Traceback when using multiprocessing)
Message-ID	<a8bf86c1-bf44-4bcf-813e-5ad4fdedde63@googlegroups.com>
In reply to	#60200

On Thursday, November 21, 2013 8:24:05 PM UTC-8, Chris Angelico wrote:

> Oh, that part's easy. Let's leave the multiprocessing module out of it
> for the moment; imagine you spin up two completely separate instances
> of Python. Create some object in one of them; now, transfer it to the
> other. How are you going to do it?

For what definition of "completely separate"?

If I have two instances of the same version of the Python interpreter running on the same hardware, and the same operating system, I expect I would just copy a block of memory from one interpreter to the other, and then write some new pointers.  That kind of data sharing has to be the most common kind.  It's also the simplest.

I understand that pickling allows sharing of Python objects between Python interpreters even if those interpreters run on different CPU's with different memory architecture, different operating systems, etc.  It just seems like overkill to me to use pickling in the simple case.

[toc] | [prev] | [next] | [standalone]

#60253 — Re: Why pickling (was: Traceback when using multiprocessing)

From	Chris Angelico <rosuav@gmail.com>
Date	2013-11-23 10:50 +1100
Subject	Re: Why pickling (was: Traceback when using multiprocessing)
Message-ID	<mailman.3058.1385164257.18130.python-list@python.org>
In reply to	#60240

On Sat, Nov 23, 2013 at 3:38 AM, John Ladasky
<john_ladasky@sbcglobal.net> wrote:
> On Thursday, November 21, 2013 8:24:05 PM UTC-8, Chris Angelico wrote:
>
>> Oh, that part's easy. Let's leave the multiprocessing module out of it
>> for the moment; imagine you spin up two completely separate instances
>> of Python. Create some object in one of them; now, transfer it to the
>> other. How are you going to do it?
>
> For what definition of "completely separate"?
>
> If I have two instances of the same version of the Python interpreter running on the same hardware, and the same operating system, I expect I would just copy a block of memory from one interpreter to the other, and then write some new pointers.  That kind of data sharing has to be the most common kind.  It's also the simplest.

Okay, so you copy a block of memory. Now how are you going to
guarantee that you picked up everything that object references? Python
objects frequently reference other objects:

send_me = [1.0, 2.0, 3.0]

The block of memory might have the addresses of those three floats,
but that'll be invalid in the target. Somehow you need to package up
this object and everything else you need.

Ultimately, you need some system for turning a single object reference
(a pointer, if you like) into the entire package of information needed
to recreate that object on the other side. That's what pickling is.
It's a compact (with people to fight for its compactness, there's
current discussion elsewhere about that) format that can be easily
transferred around, which refcounted blocks of memory can't.

ChrisA

[toc] | [prev] | [next] | [standalone]

#60210

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-22 09:09 +0000
Message-ID	<mailman.3031.1385111409.18130.python-list@python.org>
In reply to	#60196

On 22/11/2013 03:57, John Ladasky wrote:
>
> ...Richard submits his "hack" (his description) to Python 3.4 which pickles and passes the string.  When time permits, I'll try it out.  Or maybe I'll wait, since Python 3.4.0 is still in alpha.
>

FTR beta 1 is due this Saturday 24/11/2013.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [standalone]

csiph-web

Traceback when using multiprocessing, less than helpful?

Contents

#60169 — Traceback when using multiprocessing, less than helpful?

#60171

#60172

#60173

#60175

#60176

#60177

#60178

#60196

#60200

#60240 — Why pickling (was: Traceback when using multiprocessing)

#60253 — Re: Why pickling (was: Traceback when using multiprocessing)

#60210