Generator using item[n-1] + item[n] memory

Path	csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<prometheus235@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.009
X-Spam-Evidence	'H': 0.98; 'S': 0.00; 'discard': 0.07; 'skip:` 10': 0.07; 'agrees': 0.09; 'arrays': 0.09; 'method:': 0.09; 'occasionally': 0.09; 'subject:using': 0.09; 'sys,': 0.09; 'python': 0.11; 'def': 0.12; '2.7.3': 0.16; '3.3,': 0.16; 'computes': 0.16; 'current:': 0.16; 'elements,': 0.16; 'line.split()': 0.16; 'nick': 0.16; 'objects.': 0.16; 'peak': 0.16; 'profiling,': 0.16; 'resource,': 0.16; 'result[key]': 0.16; 'script,': 0.16; 'subject:item': 0.16; 'usage,': 0.16; 'elements': 0.16; 'size,': 0.16; 'skip:# 20': 0.16; 'subject:] ': 0.20; 'seems': 0.21; '(the': 0.22; 'memory': 0.22; 'example': 0.22; 'import': 0.22; 'cheers,': 0.24; 'script': 0.25; 'function': 0.29; '[1]': 0.29; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; '3.x': 0.31; '8bit%:2': 0.31; 'please.': 0.31; 'sep': 0.31; 'yields': 0.31; 'probably': 0.32; 'run': 0.32; 'text': 0.33; 'alone': 0.33; 'skip:# 10': 0.33; 'skip:b 30': 0.33; 'basic': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'version': 0.36; 'crazy': 0.36; 'yield': 0.36; 'next': 0.36; "i'll": 0.36; 'similar': 0.36; 'two': 0.37; 'list': 0.37; 'skip:o 20': 0.38; '8bit%:4': 0.38; 'process,': 0.38; 'to:addr:python- list': 0.38; 'files': 0.38; 'previous': 0.38; 'skip:& 20': 0.39; 'does': 0.39; 'subject:[': 0.39; 'itself': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'simple,': 0.60; 'new': 0.61; 'save': 0.62; 'such': 0.63; 'stand': 0.64; 'more': 0.64; 'different': 0.65; 'series': 0.66; '20,': 0.68; 'skip:r 40': 0.68; 'results': 0.69; 'saving': 0.69; 'limit': 0.70; 'url:a': 0.72; 'increase': 0.74; 'sizes:': 0.84; 'skip:/ 30': 0.84; 'usage.': 0.84; '\xa0same': 0.84; '2013,': 0.91; 'doubling': 0.91
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=TyNpE9W1bLftX6hEZ4GhjT4XedwCg9vQg8axZUvuFmE=; b=RippK1OZQypPUNeTpeaPGfgCcsj48boCjcaBtS860KOlKWAaIcVTDFlDrObiB/Iyvp H0sKQV7k1Q+hSCeMJvfb8bjLnXp4tOMewJBHzDrxap2Wu8dCSpjamSorBv7JkrWba2Sw T0VRohYBR+OzoT+t5IVa8vPIQ4+lQV2TjGxGc0Fk20nnFzsF9fDexvnHXWqwV01Th7w1 6wU2/o6cjYEWE3hu2CwJemIYDl5SL+z9evmi0l6646NwXRzDJLLKQ73hKuj0TLSn13C0 ymQxSiHpj5tgkVeoxCFHUwTCuPTokIlh5y65wc5CeWfoDfIp5m5wWsp3LGjHdANZwZ2F CVZA==
X-Received	by 10.194.179.69 with SMTP id de5mr8171147wjc.4.1392416898249; Fri, 14 Feb 2014 14:28:18 -0800 (PST)
MIME-Version	1.0
From	Nick Timkovich <prometheus235@gmail.com>
Date	Fri, 14 Feb 2014 16:27:58 -0600
Subject	Generator using item[n-1] + item[n] memory
To	python-list@python.org
Content-Type	multipart/alternative; boundary=089e01493cae9976fb04f265526b
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.6941.1392416906.18130.python-list@python.org> (permalink)
Lines	209
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1392416906 news.xs4all.nl 2853 [2001:888:2000:d::a6]:49282
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:66335

Show key headers only | View raw

[Multipart message — attachments visible in raw view] - view raw

I have a Python 3.x program that processes several large text files that
contain sizeable arrays of data that can occasionally brush up against the
memory limit of my puny workstation.  From some basic memory profiling, it
seems like when using the generator, the memory usage of my script balloons
to hold consecutive elements, using up to twice the memory I expect.

I made a simple, stand alone example to test the generator and I get
similar results in Python 2.7, 3.3, and 3.4.  My test code follows,
`memory_usage()` is a modifed version of [this function from an SO
question](http://stackoverflow.com/a/898406/194586) which uses
`/proc/self/status` and agrees with `top` as I watch it.  `resource` is
probably a more cross-platform method:

###############

import sys, resource, gc, time

def biggen():
    sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1
    for size in sizes:
        data = [1] * int(size * 1e6)
        #time.sleep(1)
        yield data

def consumer():
    for data in biggen():
        rusage = resource.getrusage(resource.RUSAGE_SELF)
        peak_mb = rusage.ru_maxrss/1024.0
        print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format(
                peak_mb, len(data)/1e6))
        #print(memory_usage())

        data = None  # go
        del data     # away
        gc.collect() # please.

# def memory_usage():
#     """Memory usage of the current process, requires /proc/self/status"""
#     # http://stackoverflow.com/a/898406/194586
#     result = {'peak': 0, 'rss': 0}
#     for line in open('/proc/self/status'):
#         parts = line.split()
#         key = parts[0][2:-1].lower()
#         if key in result:
#             result[key] = int(parts[1])/1024.0
#     return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result)

print(sys.version)
consumer()

###############

In practice I'll process data coming from such a generator loop, saving
just what I need, then discard it.

When I run the above script, and two large elements come in series (the
data size can be highly variable), it seems like Python computes the next
before freeing the previous, leading to up to double the memory usage.

    $ python genmem.py
    2.7.3 (default, Sep 26 2013, 20:08:41)
    [GCC 4.6.3]
    Peak:    7.9 MB, Data Len:    1.0 M
    Peak:   11.5 MB, Data Len:    1.0 M
    Peak:   45.8 MB, Data Len:   10.0 M
    Peak:   45.9 MB, Data Len:    1.0 M
    Peak:   45.9 MB, Data Len:    1.0 M
    Peak:   45.9 MB, Data Len:   10.0 M
    #        ^^  not much different versus previous 10M-list
    Peak:   80.2 MB, Data Len:   10.0 M
    #        ^^  same list size, but new memory peak at roughly twice the
usage
    Peak:   80.2 MB, Data Len:    1.0 M
    Peak:   80.2 MB, Data Len:    1.0 M
    Peak:   80.2 MB, Data Len:   10.0 M
    Peak:   80.2 MB, Data Len:   10.0 M
    Peak:  118.3 MB, Data Len:   20.0 M
    #        ^^  and again...  (20+10)*c
    Peak:  118.3 MB, Data Len:    1.0 M
    Peak:  118.3 MB, Data Len:    1.0 M
    Peak:  118.3 MB, Data Len:   20.0 M
    Peak:  156.5 MB, Data Len:   20.0 M
    #        ^^  and again. (20+20)*c
    Peak:  156.5 MB, Data Len:    1.0 M
    Peak:  156.5 MB, Data Len:    1.0 M

The crazy belt-and-suspenders-and-duct-tape approach `data = None`, `del
data`, and `gc.collect()` does nothing.

I'm pretty sure the generator itself is not doubling up on memory because
otherwise a single large value it yields would increase the peak usage, and
in the *same iteration* a large object appeared; it's only large
consecutive objects.

How can I save my memory?

Cheers,
Nick

cc: StackOverflow http://stackoverflow.com/q/21787099/194586

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread

Thread

Generator using item[n-1] + item[n] memory Nick Timkovich <prometheus235@gmail.com> - 2014-02-14 16:27 -0600

csiph-web