Re: Creating Dict of Dict of Lists with joblib and Multiprocessing

Path	csiph.com!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail
From	Michael Selik <michael.selik@gmail.com>
Newsgroups	comp.lang.python
Subject	Re: Creating Dict of Dict of Lists with joblib and Multiprocessing
Date	Wed, 20 Apr 2016 15:17:31 +0000
Lines	107
Message-ID	<mailman.32.1461165461.12923.python-list@python.org> (permalink)
References	<65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> <CAGgTfkNSJeV7dbm3L-wkeg0+AEDbX167m2Hfvi5JzePckZogPA@mail.gmail.com>
Mime-Version	1.0
Content-Type	text/plain; charset=UTF-8
X-Trace	news.uni-berlin.de 1gK9i4FWeaidwY0jLejV5gnjzZGNrkgYYAsZQhSKP67w==
Return-Path	<michael.selik@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.001
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'skip:p 60': 0.05; 'sys': 0.05; 'main()': 0.07; 'collections': 0.09; 'dict': 0.09; 'keyed': 0.09; 'type:': 0.09; 'undocumented': 0.09; '{})': 0.09; 'python': 0.10; 'python.': 0.11; 'def': 0.13; 'wed,': 0.15; '2016': 0.16; '``from': 0.16; 'correctly,': 0.16; 'defaultdict': 0.16; 'entries.': 0.16; 'main():': 0.16; 'pprint': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sims,': 0.16; 'structure.': 0.16; 'subprocess': 0.16; 'threads': 0.16; 'wrote:': 0.16; 'memory': 0.17; 'copied': 0.18; 'helper': 0.18; 'input': 0.18; 'to:name :python-list@python.org': 0.20; 'posted': 0.21; 'large,': 0.22; 'pass': 0.22; 'trying': 0.22; 'bit': 0.23; 'sets': 0.23; 'tried': 0.24; 'import': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'example': 0.26; 'skip:_ 20': 0.26; 'figure': 0.27; 'mostly': 0.27; 'skip:# 10': 0.27; 'message- id:@mail.gmail.com': 0.27; 'data,': 0.27; "skip:' 10": 0.28; 'delayed': 0.29; 'once,': 0.29; 'pickle': 0.29; "i'm": 0.30; 'minimal': 0.30; 'task': 0.30; "i'd": 0.31; 'anyone': 0.32; "can't": 0.32; 'maybe': 0.33; 'run': 0.33; 'source': 0.33; 'url:python': 0.33; 'skip:_ 30': 0.33; 'errors,': 0.33; "skip:' 20": 0.34; 'structure': 0.34; 'this?': 0.34; 'lists': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'replace': 0.35; 'received:74.125.82': 0.35; 'problem.': 0.35; 'skip:p 30': 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'expect': 0.37; 'hundreds': 0.37; 'thought': 0.37; 'seem': 0.37; 'version': 0.38; 'skip:p 20': 0.38; 'files': 0.38; 'hi,': 0.38; 'data': 0.39; 'rather': 0.39; 'to:addr:python.org': 0.40; 'subject:with': 0.40; 'url:3': 0.60; 'your': 0.60; 'share': 0.61; 'provide': 0.61; 'making': 0.62; 'different': 0.63; 'to,': 0.63; 'sample': 0.63; '20,': 0.66; 'here': 0.66; 'results': 0.66; 'liked': 0.67; 'subject:skip:M 10': 0.72; 'mois': 0.84; 'subject:Lists': 0.91; 'responses': 0.93
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=XJkTr6h5AHCipt2C4LDg0LTyZUVzJM4G65uUR+SHe6thyJ4RzDwfvlNG5drz/qfEsL i/IuhU2gM64MW4nPIRSV9AFGn5Uc6g5XU3caJrIhISxKYebaPR4nUaDNTxSQ763x8QQ7 VZAHdHr909IPTb8JiEK4RVV7ciPDwWXKWKwHkIUYTIGYSL6KphzvjxXk2WJDAP7lnXOn yPwp/A6Ntee/FLwjQr7XXnQe/jvJ6/aFlVFSTl/yeXP7qNXZhbTt0JBxZG0Xfy6tIuVR ww4ZoUlVeB4CJp4k7q92zNKVpabGXBXWwEIbaz9hV/4q3qT3+Hft5IOg4o2Jgs1FYSMF j3uQ==
X-Google-DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=YBxXLif8mo9Y9f7ddFXZEIbJ1mzTw+iQ3fF+r//mF0xWmv+MXEjxp5jfDb1sC3dQaw LrwqrRW2Fkmd/HFbllo6zhBIFTotHIqQsCcte/ZqsDw7wErz3WOTf+bZvyKJArp6s03S Y7p5LkUgt8bUxpasuGa0ITAi4tRixSKc7eUsfiYz5X6O0iIbVzpxMx/2sJp5m7QqBQMA MWipPmUZf0YznMlP+L0TEdnr+eBEGAw8lObJHE1AB2k9OuN2b9E/tXaZPnWX+Vg9vo0O rNrpPzm0X4MhDev8mhHSF1h7V9lvNy97t1UipYmI1azEOpOh1Iah1yY/5C6uZFf6g9Hv zkKg==
X-Gm-Message-State	AOPr4FU2JMkmvVMnWTaaDicrobb6dSr70BAFbh+Y2mdrwA+gBk3G3CFJ4oUkde9h4LyBqsY4ztnDcQN5b2kQgA==
X-Received	by 10.28.31.22 with SMTP id f22mr9977113wmf.103.1461165460576; Wed, 20 Apr 2016 08:17:40 -0700 (PDT)
In-Reply-To	<65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov>
X-Content-Filtered-By	Mailman/MimeDel 2.1.22
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.22
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID	<CAGgTfkNSJeV7dbm3L-wkeg0+AEDbX167m2Hfvi5JzePckZogPA@mail.gmail.com>
X-Mailman-Original-References	<65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov>
Xref	csiph.com comp.lang.python:107418

Show key headers only | View raw

On Wed, Apr 20, 2016 at 10:50 AM Sims, David (NIH/NCI) [C] <
david.sims2@nih.gov> wrote:

> Hi,
>
> Cross posted at
> http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing,
> but thought I'd try here too as no responses there so far.
>
> A bit new to python and very new to parallel processing in python.  I have
> a script that will process a datafile and generate a dict of dicts.
> However, as I need to run this task on hundreds to thousands of these files
> and ultimately collate the data, I thought parallel processing made a lot
> of sense.  However, I can't seem to figure out how to create a data
> structure.  Minimal script without all the helper functions:
>
> #!/usr/bin/python
> import sys
> import os
> import re
> import subprocess
> import multiprocessing
> from joblib import Parallel, delayed
> from collections import defaultdict
> from pprint import pprint
>
> def proc_vcf(vcf,results):
>     sample_name = vcf.rstrip('.vcf')
>     results.setdefault(sample_name, {})
>
>     # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a list of entries. Expect a dict of dict of lists
>     all_vars = run_cmd('vcfExtractor',vcf)
>     results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')
>
>     # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a different list of data based on a different set of criteria.
>     mois = run_cmd('moi_report', vcf)
>     results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
>     return results
>
> def main():
>     input_files = sys.argv[1:]
>
>     # collected_data = defaultdict(lambda: defaultdict(dict))
>     collected_data = {}
>
>     # Parallel Processing version
>     # num_cores = multiprocessing.cpu_count()
>     # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for
> vcf in input_files)
>
>     # for vcf in input_files:
>         # proc_vcf(vcf, collected_data)
>
>     pprint(dict(collected_data))
>     return
>
> if __name__=="__main__":
>     main()
>
>
> Hard to provide source data as it's very large, but basically, the dataset
> will generate a dict of dicts of lists that contain two sets of data for
> each input keyed by sample and data type:
>
> { 'sample1' : {
>     'all_vars' : [
>         'data_val1',
>         'data_val2',
>         'etc'],
>     'mois' : [
>         'data_val_x',
>         'data_val_y',
>         'data_val_z']
>     }
>     'sample2' : {
>        'all_vars' : [
>        .
>        .
>        .
>        ]
>     }
> }
>
> If I run it without trying to multiprocess, not a problem.  I can't figure
> out how to parallelize this and create the same data structure.  I've tried
> to use defaultdict to create a defaultdict in main() to pass along, as well
> as a few other iterations, but I can't seem to get it right (getting key
> errors, pickle errors, etc.).  Can anyone help me with the proper way to do
> this?  I think I'm not making / initializing / working with the data
> structure correctly, but maybe my whole approach is ill conceived?
>


Processes cannot share memory, so your collected_data is only copied once,
at the time you pass it to each subprocess. There's an undocumented
ThreadPool that works the same as the process Pool (
https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers
)

ThreadPool will share memory across your subthreads. In the example I liked
to, just replace ``from multiprocessing import Pool`` with ``from
multiprocessing.pool import ThreadPool``.

How compute-intensive is your task? If it's mostly disk-read-intensive
rather than compute-intensive, then threads is all you need.

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Re: Creating Dict of Dict of Lists with joblib and Multiprocessing Michael Selik <michael.selik@gmail.com> - 2016-04-20 15:17 +0000

csiph-web