Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.009 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'discard': 0.07; 'skip:` 10': 0.07; 'agrees': 0.09; 'arrays': 0.09; 'method:': 0.09; 'occasionally': 0.09; 'subject:using': 0.09; 'sys,': 0.09; 'python': 0.11; 'def': 0.12; '2.7.3': 0.16; '3.3,': 0.16; 'computes': 0.16; 'current:': 0.16; 'elements,': 0.16; 'line.split()': 0.16; 'nick': 0.16; 'objects.': 0.16; 'peak': 0.16; 'profiling,': 0.16; 'resource,': 0.16; 'result[key]': 0.16; 'script,': 0.16; 'subject:item': 0.16; 'usage,': 0.16; 'elements': 0.16; 'size,': 0.16; 'skip:# 20': 0.16; 'subject:] ': 0.20; 'seems': 0.21; '(the': 0.22; 'memory': 0.22; 'example': 0.22; 'import': 0.22; 'cheers,': 0.24; 'script': 0.25; 'function': 0.29; '[1]': 0.29; 'message-id:@mail.gmail.com': 0.30; "i'm": 0.30; 'code': 0.31; '3.x': 0.31; '8bit%:2': 0.31; 'please.': 0.31; 'sep': 0.31; 'yields': 0.31; 'probably': 0.32; 'run': 0.32; 'text': 0.33; 'alone': 0.33; 'skip:# 10': 0.33; 'skip:b 30': 0.33; 'basic': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'version': 0.36; 'crazy': 0.36; 'yield': 0.36; 'next': 0.36; "i'll": 0.36; 'similar': 0.36; 'two': 0.37; 'list': 0.37; 'skip:o 20': 0.38; '8bit%:4': 0.38; 'process,': 0.38; 'to:addr:python- list': 0.38; 'files': 0.38; 'previous': 0.38; 'skip:& 20': 0.39; 'does': 0.39; 'subject:[': 0.39; 'itself': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'simple,': 0.60; 'new': 0.61; 'save': 0.62; 'such': 0.63; 'stand': 0.64; 'more': 0.64; 'different': 0.65; 'series': 0.66; '20,': 0.68; 'skip:r 40': 0.68; 'results': 0.69; 'saving': 0.69; 'limit': 0.70; 'url:a': 0.72; 'increase': 0.74; 'sizes:': 0.84; 'skip:/ 30': 0.84; 'usage.': 0.84; '\xa0same': 0.84; '2013,': 0.91; 'doubling': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=TyNpE9W1bLftX6hEZ4GhjT4XedwCg9vQg8axZUvuFmE=; b=RippK1OZQypPUNeTpeaPGfgCcsj48boCjcaBtS860KOlKWAaIcVTDFlDrObiB/Iyvp H0sKQV7k1Q+hSCeMJvfb8bjLnXp4tOMewJBHzDrxap2Wu8dCSpjamSorBv7JkrWba2Sw T0VRohYBR+OzoT+t5IVa8vPIQ4+lQV2TjGxGc0Fk20nnFzsF9fDexvnHXWqwV01Th7w1 6wU2/o6cjYEWE3hu2CwJemIYDl5SL+z9evmi0l6646NwXRzDJLLKQ73hKuj0TLSn13C0 ymQxSiHpj5tgkVeoxCFHUwTCuPTokIlh5y65wc5CeWfoDfIp5m5wWsp3LGjHdANZwZ2F CVZA== X-Received: by 10.194.179.69 with SMTP id de5mr8171147wjc.4.1392416898249; Fri, 14 Feb 2014 14:28:18 -0800 (PST) MIME-Version: 1.0 From: Nick Timkovich Date: Fri, 14 Feb 2014 16:27:58 -0600 Subject: Generator using item[n-1] + item[n] memory To: python-list@python.org Content-Type: multipart/alternative; boundary=089e01493cae9976fb04f265526b X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 209 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1392416906 news.xs4all.nl 2853 [2001:888:2000:d::a6]:49282 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:66335 --089e01493cae9976fb04f265526b Content-Type: text/plain; charset=ISO-8859-1 I have a Python 3.x program that processes several large text files that contain sizeable arrays of data that can occasionally brush up against the memory limit of my puny workstation. From some basic memory profiling, it seems like when using the generator, the memory usage of my script balloons to hold consecutive elements, using up to twice the memory I expect. I made a simple, stand alone example to test the generator and I get similar results in Python 2.7, 3.3, and 3.4. My test code follows, `memory_usage()` is a modifed version of [this function from an SO question](http://stackoverflow.com/a/898406/194586) which uses `/proc/self/status` and agrees with `top` as I watch it. `resource` is probably a more cross-platform method: ############### import sys, resource, gc, time def biggen(): sizes = 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20, 20, 1, 1 for size in sizes: data = [1] * int(size * 1e6) #time.sleep(1) yield data def consumer(): for data in biggen(): rusage = resource.getrusage(resource.RUSAGE_SELF) peak_mb = rusage.ru_maxrss/1024.0 print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format( peak_mb, len(data)/1e6)) #print(memory_usage()) data = None # go del data # away gc.collect() # please. # def memory_usage(): # """Memory usage of the current process, requires /proc/self/status""" # # http://stackoverflow.com/a/898406/194586 # result = {'peak': 0, 'rss': 0} # for line in open('/proc/self/status'): # parts = line.split() # key = parts[0][2:-1].lower() # if key in result: # result[key] = int(parts[1])/1024.0 # return 'Peak: {peak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result) print(sys.version) consumer() ############### In practice I'll process data coming from such a generator loop, saving just what I need, then discard it. When I run the above script, and two large elements come in series (the data size can be highly variable), it seems like Python computes the next before freeing the previous, leading to up to double the memory usage. $ python genmem.py 2.7.3 (default, Sep 26 2013, 20:08:41) [GCC 4.6.3] Peak: 7.9 MB, Data Len: 1.0 M Peak: 11.5 MB, Data Len: 1.0 M Peak: 45.8 MB, Data Len: 10.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 1.0 M Peak: 45.9 MB, Data Len: 10.0 M # ^^ not much different versus previous 10M-list Peak: 80.2 MB, Data Len: 10.0 M # ^^ same list size, but new memory peak at roughly twice the usage Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 1.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 80.2 MB, Data Len: 10.0 M Peak: 118.3 MB, Data Len: 20.0 M # ^^ and again... (20+10)*c Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 1.0 M Peak: 118.3 MB, Data Len: 20.0 M Peak: 156.5 MB, Data Len: 20.0 M # ^^ and again. (20+20)*c Peak: 156.5 MB, Data Len: 1.0 M Peak: 156.5 MB, Data Len: 1.0 M The crazy belt-and-suspenders-and-duct-tape approach `data = None`, `del data`, and `gc.collect()` does nothing. I'm pretty sure the generator itself is not doubling up on memory because otherwise a single large value it yields would increase the peak usage, and in the *same iteration* a large object appeared; it's only large consecutive objects. How can I save my memory? Cheers, Nick cc: StackOverflow http://stackoverflow.com/q/21787099/194586 --089e01493cae9976fb04f265526b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I have a Python 3.x program that processes several la= rge text files that contain sizeable arrays of data that can occasionally b= rush up against the memory limit of my puny workstation. =A0From some basic= memory profiling, it seems like when using the generator, the memory usage= of my script balloons to hold consecutive elements, using up to twice the = memory I expect.

I made a simple, stand alone example to test the genera= tor and I get similar results in Python 2.7, 3.3, and 3.4. =A0My test code = follows, `memory_usage()` is a modifed version of [this function from an SO= question](http://stac= koverflow.com/a/898406/194586) which uses `/proc/self/status` and agree= s with `top` as I watch it. =A0`resource` is probably a more cross-platform= method:

###############

import sys, resource, gc, time

def biggen():
=A0 =A0 sizes =3D 1, 1, 10, 1, 1, 10, 10, 1, 1, 10, 10, 20, 1, 1, 20= , 20, 1, 1
=A0 =A0 for size in sizes:
=A0 =A0 =A0 =A0 data =3D [1] * in= t(size * 1e6)
=A0 =A0 =A0 =A0 #time.sleep(1)
=A0 =A0 = =A0 =A0 yield data

def consumer():
=A0 = =A0 for data in biggen():
=A0 =A0 =A0 =A0 rusage =3D resource.getrusage(resource.RUSAGE_SELF)
=A0 =A0 =A0 =A0 peak_mb =3D rusage.ru_maxrss/1024.0
=A0 =A0 = =A0 =A0 print('Peak: {0:6.1f} MB, Data Len: {1:6.1f} M'.format(
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 peak_mb, len(data)/1e6))
=A0 =A0 =A0 =A0 #print(memory_usage())

=A0 = =A0 =A0 =A0 data =3D None =A0# go
=A0 =A0 =A0 =A0 del data =A0 = =A0 # away
=A0 =A0 =A0 =A0 gc.collect() # please.

<= /div>
# def memory_usage():
# =A0 =A0 """Memory usage of the current process, requires /= proc/self/status"""
# =A0 =A0 result =3D {'peak': 0, 'rss': 0}
#= =A0 =A0 for line in open('/proc/self/status'):
# =A0 =A0= =A0 =A0 parts =3D line.split()
# =A0 =A0 =A0 =A0 key =3D parts[0= ][2:-1].lower()
# =A0 =A0 =A0 =A0 if key in result:
# =A0 =A0 =A0 =A0 =A0 =A0 res= ult[key] =3D int(parts[1])/1024.0
# =A0 =A0 return 'Peak: {pe= ak:6.1f} MB, Current: {rss:6.1f} MB'.format(**result)

print(sys.version)
consumer()

###############
In practice I'll process data coming from such a generator= loop, saving just what I need, then discard it.

When I run the above script, and two large elements come in series (the dat= a size can be highly variable), it seems like Python computes the next befo= re freeing the previous, leading to up to double the memory usage.

=A0 =A0 $ python genmem.py=A0
=A0 =A0 2.7.3 (= default, Sep 26 2013, 20:08:41)=A0
=A0 =A0 [GCC 4.6.3]
= =A0 =A0 Peak: =A0 =A07.9 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak:= =A0 11.5 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0 45.8 MB, Data Len: =A0 10.0 M
=A0 =A0 Peak= : =A0 45.9 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0 45.9 MB, = Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0 45.9 MB, Data Len: =A0 1= 0.0 M
=A0 =A0 # =A0 =A0 =A0 =A0^^ =A0not much different versus pr= evious 10M-list
=A0 =A0 Peak: =A0 80.2 MB, Data Len: =A0 10.0 M
=A0 =A0 # = =A0 =A0 =A0 =A0^^ =A0same list size, but new memory peak at roughly twice t= he usage
=A0 =A0 Peak: =A0 80.2 MB, Data Len: =A0 =A01.0 M
<= div>=A0 =A0 Peak: =A0 80.2 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0 80.2 MB, Data Len: =A0 10.0 M
=A0 =A0 Peak= : =A0 80.2 MB, Data Len: =A0 10.0 M
=A0 =A0 Peak: =A0118.3 MB, Da= ta Len: =A0 20.0 M
=A0 =A0 # =A0 =A0 =A0 =A0^^ =A0and again... = =A0(20+10)*c
=A0 =A0 Peak: =A0118.3 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0118.3 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Pe= ak: =A0118.3 MB, Data Len: =A0 20.0 M
=A0 =A0 Peak: =A0156.5 MB, = Data Len: =A0 20.0 M
=A0 =A0 # =A0 =A0 =A0 =A0^^ =A0and again. (2= 0+20)*c
=A0 =A0 Peak: =A0156.5 MB, Data Len: =A0 =A01.0 M
=A0 =A0 Peak: =A0156.5 MB, Data Len: =A0 =A01.0 M

=
The crazy belt-and-suspenders-and-duct-tape approach `data =3D None`, = `del data`, and `gc.collect()` does nothing.

I'= ;m pretty sure the generator itself is not doubling up on memory because ot= herwise a single large value it yields would increase the peak usage, and i= n the *same iteration* a large object appeared; it's only large consecu= tive objects.

How can I save my memory?

Chee= rs,
Nick

--089e01493cae9976fb04f265526b--