Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #107418
| Path | csiph.com!news.mixmin.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Michael Selik <michael.selik@gmail.com> |
| Newsgroups | comp.lang.python |
| Subject | Re: Creating Dict of Dict of Lists with joblib and Multiprocessing |
| Date | Wed, 20 Apr 2016 15:17:31 +0000 |
| Lines | 107 |
| Message-ID | <mailman.32.1461165461.12923.python-list@python.org> (permalink) |
| References | <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> <CAGgTfkNSJeV7dbm3L-wkeg0+AEDbX167m2Hfvi5JzePckZogPA@mail.gmail.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=UTF-8 |
| X-Trace | news.uni-berlin.de 1gK9i4FWeaidwY0jLejV5gnjzZGNrkgYYAsZQhSKP67w== |
| Return-Path | <michael.selik@gmail.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.001 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'skip:p 60': 0.05; 'sys': 0.05; 'main()': 0.07; 'collections': 0.09; 'dict': 0.09; 'keyed': 0.09; 'type:': 0.09; 'undocumented': 0.09; '{})': 0.09; 'python': 0.10; 'python.': 0.11; 'def': 0.13; 'wed,': 0.15; '2016': 0.16; '``from': 0.16; 'correctly,': 0.16; 'defaultdict': 0.16; 'entries.': 0.16; 'main():': 0.16; 'pprint': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'sims,': 0.16; 'structure.': 0.16; 'subprocess': 0.16; 'threads': 0.16; 'wrote:': 0.16; 'memory': 0.17; 'copied': 0.18; 'helper': 0.18; 'input': 0.18; 'to:name :python-list@python.org': 0.20; 'posted': 0.21; 'large,': 0.22; 'pass': 0.22; 'trying': 0.22; 'bit': 0.23; 'sets': 0.23; 'tried': 0.24; 'import': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'example': 0.26; 'skip:_ 20': 0.26; 'figure': 0.27; 'mostly': 0.27; 'skip:# 10': 0.27; 'message- id:@mail.gmail.com': 0.27; 'data,': 0.27; "skip:' 10": 0.28; 'delayed': 0.29; 'once,': 0.29; 'pickle': 0.29; "i'm": 0.30; 'minimal': 0.30; 'task': 0.30; "i'd": 0.31; 'anyone': 0.32; "can't": 0.32; 'maybe': 0.33; 'run': 0.33; 'source': 0.33; 'url:python': 0.33; 'skip:_ 30': 0.33; 'errors,': 0.33; "skip:' 20": 0.34; 'structure': 0.34; 'this?': 0.34; 'lists': 0.34; 'list': 0.34; 'received:google.com': 0.35; 'replace': 0.35; 'received:74.125.82': 0.35; 'problem.': 0.35; 'skip:p 30': 0.35; 'but': 0.36; 'too': 0.36; 'there': 0.36; 'url:org': 0.36; 'url:library': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'expect': 0.37; 'hundreds': 0.37; 'thought': 0.37; 'seem': 0.37; 'version': 0.38; 'skip:p 20': 0.38; 'files': 0.38; 'hi,': 0.38; 'data': 0.39; 'rather': 0.39; 'to:addr:python.org': 0.40; 'subject:with': 0.40; 'url:3': 0.60; 'your': 0.60; 'share': 0.61; 'provide': 0.61; 'making': 0.62; 'different': 0.63; 'to,': 0.63; 'sample': 0.63; '20,': 0.66; 'here': 0.66; 'results': 0.66; 'liked': 0.67; 'subject:skip:M 10': 0.72; 'mois': 0.84; 'subject:Lists': 0.91; 'responses': 0.93 |
| DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=XJkTr6h5AHCipt2C4LDg0LTyZUVzJM4G65uUR+SHe6thyJ4RzDwfvlNG5drz/qfEsL i/IuhU2gM64MW4nPIRSV9AFGn5Uc6g5XU3caJrIhISxKYebaPR4nUaDNTxSQ763x8QQ7 VZAHdHr909IPTb8JiEK4RVV7ciPDwWXKWKwHkIUYTIGYSL6KphzvjxXk2WJDAP7lnXOn yPwp/A6Ntee/FLwjQr7XXnQe/jvJ6/aFlVFSTl/yeXP7qNXZhbTt0JBxZG0Xfy6tIuVR ww4ZoUlVeB4CJp4k7q92zNKVpabGXBXWwEIbaz9hV/4q3qT3+Hft5IOg4o2Jgs1FYSMF j3uQ== |
| X-Google-DKIM-Signature | v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=0ljlJCkj/SAG5tZZwigbV4eaxxJQ/IV4bsr2facegVE=; b=YBxXLif8mo9Y9f7ddFXZEIbJ1mzTw+iQ3fF+r//mF0xWmv+MXEjxp5jfDb1sC3dQaw LrwqrRW2Fkmd/HFbllo6zhBIFTotHIqQsCcte/ZqsDw7wErz3WOTf+bZvyKJArp6s03S Y7p5LkUgt8bUxpasuGa0ITAi4tRixSKc7eUsfiYz5X6O0iIbVzpxMx/2sJp5m7QqBQMA MWipPmUZf0YznMlP+L0TEdnr+eBEGAw8lObJHE1AB2k9OuN2b9E/tXaZPnWX+Vg9vo0O rNrpPzm0X4MhDev8mhHSF1h7V9lvNy97t1UipYmI1azEOpOh1Iah1yY/5C6uZFf6g9Hv zkKg== |
| X-Gm-Message-State | AOPr4FU2JMkmvVMnWTaaDicrobb6dSr70BAFbh+Y2mdrwA+gBk3G3CFJ4oUkde9h4LyBqsY4ztnDcQN5b2kQgA== |
| X-Received | by 10.28.31.22 with SMTP id f22mr9977113wmf.103.1461165460576; Wed, 20 Apr 2016 08:17:40 -0700 (PDT) |
| In-Reply-To | <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> |
| X-Content-Filtered-By | Mailman/MimeDel 2.1.22 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.22 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| X-Mailman-Original-Message-ID | <CAGgTfkNSJeV7dbm3L-wkeg0+AEDbX167m2Hfvi5JzePckZogPA@mail.gmail.com> |
| X-Mailman-Original-References | <65DB2690-8988-41A7-B5BC-EA0390CDA1DF@nih.gov> |
| Xref | csiph.com comp.lang.python:107418 |
Show key headers only | View raw
On Wed, Apr 20, 2016 at 10:50 AM Sims, David (NIH/NCI) [C] <
david.sims2@nih.gov> wrote:
> Hi,
>
> Cross posted at
> http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing,
> but thought I'd try here too as no responses there so far.
>
> A bit new to python and very new to parallel processing in python. I have
> a script that will process a datafile and generate a dict of dicts.
> However, as I need to run this task on hundreds to thousands of these files
> and ultimately collate the data, I thought parallel processing made a lot
> of sense. However, I can't seem to figure out how to create a data
> structure. Minimal script without all the helper functions:
>
> #!/usr/bin/python
> import sys
> import os
> import re
> import subprocess
> import multiprocessing
> from joblib import Parallel, delayed
> from collections import defaultdict
> from pprint import pprint
>
> def proc_vcf(vcf,results):
> sample_name = vcf.rstrip('.vcf')
> results.setdefault(sample_name, {})
>
> # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a list of entries. Expect a dict of dict of lists
> all_vars = run_cmd('vcfExtractor',vcf)
> results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')
>
> # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a different list of data based on a different set of criteria.
> mois = run_cmd('moi_report', vcf)
> results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
> return results
>
> def main():
> input_files = sys.argv[1:]
>
> # collected_data = defaultdict(lambda: defaultdict(dict))
> collected_data = {}
>
> # Parallel Processing version
> # num_cores = multiprocessing.cpu_count()
> # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for
> vcf in input_files)
>
> # for vcf in input_files:
> # proc_vcf(vcf, collected_data)
>
> pprint(dict(collected_data))
> return
>
> if __name__=="__main__":
> main()
>
>
> Hard to provide source data as it's very large, but basically, the dataset
> will generate a dict of dicts of lists that contain two sets of data for
> each input keyed by sample and data type:
>
> { 'sample1' : {
> 'all_vars' : [
> 'data_val1',
> 'data_val2',
> 'etc'],
> 'mois' : [
> 'data_val_x',
> 'data_val_y',
> 'data_val_z']
> }
> 'sample2' : {
> 'all_vars' : [
> .
> .
> .
> ]
> }
> }
>
> If I run it without trying to multiprocess, not a problem. I can't figure
> out how to parallelize this and create the same data structure. I've tried
> to use defaultdict to create a defaultdict in main() to pass along, as well
> as a few other iterations, but I can't seem to get it right (getting key
> errors, pickle errors, etc.). Can anyone help me with the proper way to do
> this? I think I'm not making / initializing / working with the data
> structure correctly, but maybe my whole approach is ill conceived?
>
Processes cannot share memory, so your collected_data is only copied once,
at the time you pass it to each subprocess. There's an undocumented
ThreadPool that works the same as the process Pool (
https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers
)
ThreadPool will share memory across your subthreads. In the example I liked
to, just replace ``from multiprocessing import Pool`` with ``from
multiprocessing.pool import ThreadPool``.
How compute-intensive is your task? If it's mostly disk-read-intensive
rather than compute-intensive, then threads is all you need.
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Creating Dict of Dict of Lists with joblib and Multiprocessing Michael Selik <michael.selik@gmail.com> - 2016-04-20 15:17 +0000
csiph-web