Path: csiph.com!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail From: Michael Selik Newsgroups: comp.lang.python Subject: Re: Creating Dict of Dict of Lists with joblib and Multiprocessing Date: Wed, 20 Apr 2016 15:17:31 +0000 Lines: 107 Message-ID: References: <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de 1gK9i4FWeaidwY0jLejV5gnjzZGNrkgYYAsZQhSKP67w== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'skip:p 60': 0.05; 'sys': 0.05; 'main()': 0.07; 'collections': 0.09; 'dict': 0.09; 'keyed': 0.09; 'type:': 0.09; 'undocumented': 0.09; '{})': 0.09; 'python': 0.10; 'python.': 0.11; 'def': 0.13; 'wed,': 0.15; '2016': 0.16; '``from': 0.16; 'correctly,': 0.16; 'defaultdict': 0.16; 'entries.': 0.16; 'main():': 0.16; 'pprint': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sims,': 0.16; 'structure.': 0.16; 'subprocess': 0.16; 'threads': 0.16; 'wrote:': 0.16; 'memory': 0.17; 'copied': 0.18; 'helper': 0.18; 'input': 0.18; 'to:name :python-list@python.org': 0.20; 'posted': 0.21; 'large,': 0.22; 'pass': 0.22; 'trying': 0.22; 'bit': 0.23; 'sets': 0.23; 'tried': 0.24; 'import': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'example': 0.26; 'skip:_ 20': 0.26; 'figure': 0.27; 'mostly': 0.27; 'skip:# 10': 0.27; 'message- id:@mail.gmail.com': 0.27; 'data,': 0.27; "skip:' 10": 0.28; 'delayed': 0.29; 'once,': 0.29; 'pickle': 0.29; "i'm": 0.30; 'minimal': 0.30; 'task': 0.30; "i'd": 0.31; 'anyone': 0.32; "can't": 0.32; 'maybe': 0.33; 'run': 0.33; 'source': 0.33; 'url:python': 0.33; 'skip:_ 30': 0.33; 'errors,': 0.33; "skip:' 20": 0.34; 'structure': 0.34; 'this?': 0.34; 'lists': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'replace': 0.35; 'received:74.125.82': 0.35; 'problem.': 0.35; 'skip:p 30': 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'expect': 0.37; 'hundreds': 0.37; 'thought': 0.37; 'seem': 0.37; 'version': 0.38; 'skip:p 20': 0.38; 'files': 0.38; 'hi,': 0.38; 'data': 0.39; 'rather': 0.39; 'to:addr:python.org': 0.40; 'subject:with': 0.40; 'url:3': 0.60; 'your': 0.60; 'share': 0.61; 'provide': 0.61; 'making': 0.62; 'different': 0.63; 'to,': 0.63; 'sample': 0.63; '20,': 0.66; 'here': 0.66; 'results': 0.66; 'liked': 0.67; 'subject:skip:M 10': 0.72; 'mois': 0.84; 'subject:Lists': 0.91; 'responses': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=XJkTr6h5AHCipt2C4LDg0LTyZUVzJM4G65uUR+SHe6thyJ4RzDwfvlNG5drz/qfEsL i/IuhU2gM64MW4nPIRSV9AFGn5Uc6g5XU3caJrIhISxKYebaPR4nUaDNTxSQ763x8QQ7 VZAHdHr909IPTb8JiEK4RVV7ciPDwWXKWKwHkIUYTIGYSL6KphzvjxXk2WJDAP7lnXOn yPwp/A6Ntee/FLwjQr7XXnQe/jvJ6/aFlVFSTl/yeXP7qNXZhbTt0JBxZG0Xfy6tIuVR ww4ZoUlVeB4CJp4k7q92zNKVpabGXBXWwEIbaz9hV/4q3qT3+Hft5IOg4o2Jgs1FYSMF j3uQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=YBxXLif8mo9Y9f7ddFXZEIbJ1mzTw+iQ3fF+r//mF0xWmv+MXEjxp5jfDb1sC3dQaw LrwqrRW2Fkmd/HFbllo6zhBIFTotHIqQsCcte/ZqsDw7wErz3WOTf+bZvyKJArp6s03S Y7p5LkUgt8bUxpasuGa0ITAi4tRixSKc7eUsfiYz5X6O0iIbVzpxMx/2sJp5m7QqBQMA MWipPmUZf0YznMlP+L0TEdnr+eBEGAw8lObJHE1AB2k9OuN2b9E/tXaZPnWX+Vg9vo0O rNrpPzm0X4MhDev8mhHSF1h7V9lvNy97t1UipYmI1azEOpOh1Iah1yY/5C6uZFf6g9Hv zkKg== X-Gm-Message-State: AOPr4FU2JMkmvVMnWTaaDicrobb6dSr70BAFbh+Y2mdrwA+gBk3G3CFJ4oUkde9h4LyBqsY4ztnDcQN5b2kQgA== X-Received: by 10.28.31.22 with SMTP id f22mr9977113wmf.103.1461165460576; Wed, 20 Apr 2016 08:17:40 -0700 (PDT) In-Reply-To: <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> Xref: csiph.com comp.lang.python:107418 On Wed, Apr 20, 2016 at 10:50 AM Sims, David (NIH/NCI) [C] < david.sims2@nih.gov> wrote: > Hi, > > Cross posted at > http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing, > but thought I'd try here too as no responses there so far. > > A bit new to python and very new to parallel processing in python. I have > a script that will process a datafile and generate a dict of dicts. > However, as I need to run this task on hundreds to thousands of these files > and ultimately collate the data, I thought parallel processing made a lot > of sense. However, I can't seem to figure out how to create a data > structure. Minimal script without all the helper functions: > > #!/usr/bin/python > import sys > import os > import re > import subprocess > import multiprocessing > from joblib import Parallel, delayed > from collections import defaultdict > from pprint import pprint > > def proc_vcf(vcf,results): > sample_name = vcf.rstrip('.vcf') > results.setdefault(sample_name, {}) > > # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to > generate a list of entries. Expect a dict of dict of lists > all_vars = run_cmd('vcfExtractor',vcf) > results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all') > > # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to > generate a different list of data based on a different set of criteria. > mois = run_cmd('moi_report', vcf) > results[sample_name]['mois'] = parse_variant_data(mois, 'moi') > return results > > def main(): > input_files = sys.argv[1:] > > # collected_data = defaultdict(lambda: defaultdict(dict)) > collected_data = {} > > # Parallel Processing version > # num_cores = multiprocessing.cpu_count() > # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for > vcf in input_files) > > # for vcf in input_files: > # proc_vcf(vcf, collected_data) > > pprint(dict(collected_data)) > return > > if __name__=="__main__": > main() > > > Hard to provide source data as it's very large, but basically, the dataset > will generate a dict of dicts of lists that contain two sets of data for > each input keyed by sample and data type: > > { 'sample1' : { > 'all_vars' : [ > 'data_val1', > 'data_val2', > 'etc'], > 'mois' : [ > 'data_val_x', > 'data_val_y', > 'data_val_z'] > } > 'sample2' : { > 'all_vars' : [ > . > . > . > ] > } > } > > If I run it without trying to multiprocess, not a problem. I can't figure > out how to parallelize this and create the same data structure. I've tried > to use defaultdict to create a defaultdict in main() to pass along, as well > as a few other iterations, but I can't seem to get it right (getting key > errors, pickle errors, etc.). Can anyone help me with the proper way to do > this? I think I'm not making / initializing / working with the data > structure correctly, but maybe my whole approach is ill conceived? > Processes cannot share memory, so your collected_data is only copied once, at the time you pass it to each subprocess. There's an undocumented ThreadPool that works the same as the process Pool ( https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers ) ThreadPool will share memory across your subthreads. In the example I liked to, just replace ``from multiprocessing import Pool`` with ``from multiprocessing.pool import ThreadPool``. How compute-intensive is your task? If it's mostly disk-read-intensive rather than compute-intensive, then threads is all you need.