Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'interpreter': 0.05; 'output': 0.05; 'root': 0.05; 'say,': 0.05; '(especially': 0.07; 'correct.': 0.07; 'initialize': 0.07; 'made.': 0.07; '__name__': 0.09; 'assuming': 0.09; 'correct,': 0.09; 'expectation': 0.09; 'global,': 0.09; 'instance.': 0.09; 'objects,': 0.09; 'okay': 0.09; 'postgresql,': 0.09; 'read-only': 0.09; 'subject:into': 0.09; 'url:github': 0.09; 'used.': 0.09; 'windows,': 0.09; 'cc:addr:python-list': 0.11; 'python': 0.11; 'suggest': 0.14; 'mostly': 0.14; 'thread': 0.14; 'bytecode': 0.16; 'caveats': 0.16; 'clone': 0.16; 'databasing,': 0.16; 'empty.': 0.16; 'fork': 0.16; 'happily': 0.16; 'how,': 0.16; 'line).': 0.16; 'messy': 0.16; "module's": 0.16; 'once.': 0.16; 'osx)': 0.16; 'proceeds': 0.16; 'say.': 0.16; 'simplest': 0.16; 'splits': 0.16; 'subject:skip:m 10': 0.16; 'such,': 0.16; 'targets': 0.16; 'tcp': 0.16; 'wrote:': 0.18; 'obviously': 0.18; 'all,': 0.19; 'module': 0.19; 'file,': 0.19; 'seems': 0.21; 'memory': 0.22; 'aug': 0.22; 'separate': 0.22; 'cc:addr:python.org': 0.22; 'entries': 0.24; 'instance,': 0.24; 'of.': 0.24; 'passes': 0.24; '(or': 0.24; 'cc:2**0': 0.24; "i've": 0.25; 'equivalent': 0.26; 'references': 0.26; 'read,': 0.26; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'specifically': 0.29; 'chris': 0.29; 'am,': 0.29; 'unix': 0.29; 'related': 0.29; "doesn't": 0.30; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'url:mailman': 0.30; '(which': 0.31; 'code': 0.31; 'url:wiki': 0.31; '13,': 0.31; 'correctly.': 0.31; 'fine,': 0.31; 'initialized': 0.31; 'linux.': 0.31; 'skip:q 20': 0.31; "they'll": 0.31; 'url:wikipedia': 0.31; 'yes.': 0.31; 'file': 0.32; 'run': 0.32; 'another': 0.32; 'url:python': 0.33; 'cases': 0.33; 'sense': 0.34; 'problem': 0.35; 'connection': 0.35; 'objects': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'really': 0.36; 'yield': 0.36; 'done': 0.36; 'url:listinfo': 0.36; 'doing': 0.36; 'thanks': 0.36; 'url:org': 0.36; 'should': 0.36; 'so,': 0.37; 'performance': 0.37; 'being': 0.38; 'connections': 0.38; 'depends': 0.38; 'process,': 0.38; 'work?': 0.38; 'writes': 0.38; 'issue': 0.38; 'rather': 0.38; 'expect': 0.39; 'explain': 0.39; 'does': 0.39; 'simply': 0.61; "you're": 0.61; 'times': 0.62; "you've": 0.63; 'guarantee': 0.63; 'kind': 0.63; 'such': 0.63; 'more': 0.64; 'great': 0.65; 'to:addr:gmail.com': 0.65; 'within': 0.65; 'it!': 0.67; 'believe': 0.68; 'atm': 0.68; 'facilities': 0.69; 'safe': 0.72; 'blogs': 0.78; '(print': 0.84; 'awesome,': 0.84; 'effectively,': 0.84; 'yielded': 0.84; '(running': 0.91; 'processes,': 0.91; 'connection,': 0.95; 'serious': 0.97; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Ce1bsY8SpPs5RGZB6bELKNTRNL2TQeSUffkOE3F8SkM=; b=Fcqkzg6UHDYZ4FU0GPoa3s3EkPZG4u5tR0XhsQOA6/FkTlfsc0+ElGdaneDEoV9BA1 nvuf0ec86NYu7gJoVUEmeXi35MoOBc/Zq45bJ7LU2OvyLJqw6CdXsWR7CuM6Op9C/HfL 1JlSbsFtRjt0b2J9dpUfqZGGh9UvpGxbRAsj6/zXVRGI9o5ZfI8tBUxvp6b/aRZLeuun nhNwlW7pLcD+BV4ZMvf/Lo5DKB+GmctGaTr/JatLmErFkdkhBqyXYJjafvbcUPEhSRXI amWy8Fo4laDU6zk9s4RTkDLizwvZH07H3aKrsOSPcpbkMhW/uH3FyDfkUaAlxJDQuotL WM/g== MIME-Version: 1.0 X-Received: by 10.15.64.194 with SMTP id o42mr3663694eex.62.1376490206752; Wed, 14 Aug 2013 07:23:26 -0700 (PDT) In-Reply-To: References: Date: Wed, 14 Aug 2013 07:23:26 -0700 Subject: Re: Digging into multiprocessing From: Demian Brecht To: Chris Angelico Content-Type: text/plain; charset=ISO-8859-1 Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 97 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1376490208 news.xs4all.nl 15905 [2001:888:2000:d::a6]:37608 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:52518 Awesome, thanks for the detailed response Chris. On Tue, Aug 13, 2013 at 8:03 AM, Chris Angelico wrote: > On Tue, Aug 13, 2013 at 12:17 AM, Demian Brecht wrote: >> Hi all, >> >> Some work that I'm doing atm is in some serious need of >> parallelization. As such, I've been digging into the multiprocessing >> module more than I've had to before and I had a few questions come up >> as a result: >> >> (Running 2.7.5+ on OSX) >> >> 1. From what I've read, a new Python interpreter instance is kicked >> off for every worker. My immediate assumption was that the file that >> the code was in would be reloaded for every instance. After some >> digging, this is obviously not the case (print __name__ at the top of >> the file only yield a single output line). So, I'm assuming that >> there's some optimization that passes of the bytecode within the >> interpreter? How, exactly does this work? (I couldn't really find much >> in the docs about it, or am I just not looking in the right place?) > > I don't know about OSX specifically, but I believe it forks, same as > on Linux. That means all your initialization code is done once. Be > aware that this is NOT the case on Windows. > > http://en.wikipedia.org/wiki/Fork_(operating_system) > > Effectively, code execution proceeds down a single thread until the > point of forking, and then the fork call returns twice. Can be messy > to explain but it makes great sense once you grok it! > >> 2. For cases using methods such as map_async/wait, once the bytecode >> has been passed into the child process, `target` is called `n` times >> until the current queue is empty. Is this correct? > > That would be about right, yes. The intention is that it's equivalent > to map(), only it splits the work across multiple processes; so the > expectation is that it will call target for each yielded item in the > iterable. > >> 3. Because __main__ is only run when the root process imports, if >> using global, READ-ONLY objects, such as, say, a database connection, >> then it might be better from a performance standpoint to initialize >> that at main, relying on the interpreter references to be passed >> around correctly. I've read some blogs and such that suggest that you >> should create a new database connection within your child process >> targets (or code called into by the targets). This seems to be less >> than optimal to me if my assumption is correct. > > This depends hugely on the objects you're working with. If your > database connection uses a TCP socket, for instance, all forked > processes will share the same socket, which will most likely result in > interleaved writes and messed-up reads. But with a log file, that > might be okay (especially if you have some kind of atomicity guarantee > that ensures that individual log entries don't interleave). The > problem isn't really the Python objects (which will have been happily > cloned by the fork() procedure), but the OS-level resources used. > > With a good database like PostgreSQL, and reasonable numbers of > workers (say, 10-50, rather than 1000-5000), you should be able to > simply establish separate connections for each subprocess without > worrying about performance. If you really need billions of worker > processes, it might be best to use one of the multiprocessing module's > queueing/semaphoring facilities and either have one process that does > all databasing, or let them all use it but serially. But if you can > manage with separate connections, that would be the easiest, safest, > and simplest to debug. > >> 4. Related to 3, read-only objects that are initialized prior to being >> passed into a sub-process are safe to reuse as long as they are >> treated as being immutable. Any other objects should use one of the >> shared memory features. >> >> Is this more or less correct, or am I just off my rocker? > > When you fork, each process will get its own clone of the objects in > the parent. For read-only objects (module-level constants and such), > this is fine, as you say. The issue is if you want another process to > "see" the change you made. That's when you need some form of shared > data. > > So, yes, more or less correct; at least, what you've said is mostly > right for Unix - there may be some additional caveats for OSX > specifically that I'm not aware of. But I expect they'll be minor; > it's mainly Windows, which doesn't *have* fork(2), where there are > major differences. > > ChrisA > -- > http://mail.python.org/mailman/listinfo/python-list -- Demian Brecht http://demianbrecht.github.com