Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'exception': 0.03; 'output': 0.04; 'string.': 0.04; 'subject:Python': 0.05; 'cpython': 0.05; 'lines,': 0.05; 'skip:` 10': 0.05; 'sufficient': 0.05; 'converts': 0.07; 'extracted': 0.07; 'finished,': 0.07; 'lines.': 0.07; 'parsing': 0.07; 'python': 0.09; 'calculating': 0.09; 'doubles': 0.09; 'garbage': 0.09; 'idea?': 0.09; 'naturally': 0.09; 'parsed': 0.09; 'read()': 0.09; 'rows,': 0.09; 'cc:addr:python-list': 0.10; 'file,': 0.15; '(pdb)': 0.16; 'enough.': 0.16; 'row': 0.16; 'later': 0.16; 'string': 0.17; 'wrote:': 0.17; 'everyone,': 0.17; 'memory': 0.18; 'windows': 0.19; 'code.': 0.20; 'trying': 0.21; 'thanks.': 0.21; 'runs': 0.22; 'cc:2**0': 0.23; 'split': 0.23; 'task': 0.23; 'to:2**1': 0.23; 'cc:no real name:2**0': 0.24; 'second': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'looks': 0.26; '(most': 0.27; 'done.': 0.27; 'necessary.': 0.27; 'skip:s 60': 0.27; 'lines': 0.28; 'subject:/': 0.28; 'run': 0.28; 'thinks': 0.29; 'time:': 0.29; 'array': 0.29; 'objects': 0.29; 'starts': 0.29; 'source': 0.29; 'writes': 0.30; 'error': 0.30; 'gets': 0.32; 'file': 0.32; 'space,': 0.32; 'goes': 0.33; 'skip:s 30': 0.33; 'correctly.': 0.33; 'function.': 0.33; 'monitored': 0.33; 'traceback': 0.33; "can't": 0.34; 'program,': 0.34; 'done': 0.34; 'nature': 0.35; 'pm,': 0.35; 'there': 0.35; 'but': 0.36; 'should': 0.36; 'enough': 0.36; 'being': 0.37; 'why': 0.37; 'item': 0.37; 'subject:: ': 0.38; 'sure': 0.38; 'delete': 0.38; 'shows': 0.38; 'received:192': 0.39; 'space': 0.39; 'skip:" 10': 0.40; 'received:192.168': 0.40; 'end': 0.40; 'containing': 0.61; 'first': 0.61; 'free': 0.61; 'more.': 0.62; 'repeat': 0.62; 'more': 0.63; 'total': 0.65; 'stuck': 0.65; 'header:Reply-To:1': 0.68; 'phone': 0.68; 'received:74.208': 0.71; 'million': 0.72; 'reply-to:no real name:2**0': 0.72; 'day': 0.73; 'hong': 0.91; 'kong': 0.91; '***': 0.93; 'commerce': 0.93 Date: Sun, 16 Sep 2012 22:12:46 -0400 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: "Jadhav, Alok" Subject: Re: Python garbage collector/memory manager behaving strangely References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:gpOgnGFuwMjBI4YmtglcsyQZgJxY2l7rKw3H9xkE1Sf V145c4jk/YMMubztOd7PLnKZsPuQURFVhzH6urO+D31OaKwu2A xJuuXJ6Z4BO1NIc7lfueE6oY4FPktJZDRXqp1LOd4xZYNXvdu5 t1E2nwOD6oH1203OHAV/Z/5LZhHPeBZHE7El6ZUeKISSsB6ilC ymJngayX3g88JwHSNDktJ6aPlVSskxLc2/uJnzEF7uaCD2eyuF yoCAWQ9OublZ9+mTwfkyxM7pky5N5fnTrplKmv2/769t50Wu3K rJUyf2/Yq6aPo05I9WOK0WS4jeT/QnFuUfgsFPL0Ub6RgfO2Q= = Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: d@davea.name List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 134 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1347847997 news.xs4all.nl 6845 [2001:888:2000:d::a6]:57623 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:29356 On 09/16/2012 09:07 PM, Jadhav, Alok wrote: > Hi Everyone, > > > > I have a simple program which reads a large file containing few million > rows, parses each row (`numpy array`) and converts into an array of > doubles (`python array`) and later writes into an `hdf5 file`. I repeat > this loop for multiple days. After reading each file, i delete all the > objects and call garbage collector. When I run the program, First day > is parsed without any error but on the second day i get `MemoryError`. I > monitored the memory usage of my program, during first day of parsing, > memory usage is around **1.5 GB**. When the first day parsing is > finished, memory usage goes down to **50 MB**. Now when 2nd day starts > and i try to read the lines from the file I get `MemoryError`. Following > is the output of the program. > > > > > > source file extracted at C:\rfadump\au\2012.08.07.txt > > parsing started > > current time: 2012-09-16 22:40:16.829000 > > 500000 lines parsed > > 1000000 lines parsed > > 1500000 lines parsed > > 2000000 lines parsed > > 2500000 lines parsed > > 3000000 lines parsed > > 3500000 lines parsed > > 4000000 lines parsed > > 4500000 lines parsed > > 5000000 lines parsed > > parsing done. > > end time is 2012-09-16 23:34:19.931000 > > total time elapsed 0:54:03.102000 > > repacking file > > done > > > s:\users\aaj\projects\pythonhf\rfadumptohdf.py(132)generateFiles() > > -> while single_date <= self.end_date: > > (Pdb) c > > *** 2012-08-08 *** > > source file extracted at C:\rfadump\au\2012.08.08.txt > > cought an exception while generating file for day 2012-08-08. > > Traceback (most recent call last): > > File "rfaDumpToHDF.py", line 175, in generateFile > > lines = self.rawfile.read().split('|\n') > > MemoryError > > > > I am very sure that windows system task manager shows the memory usage > as **50 MB** for this process. It looks like the garbage collector or > memory manager for Python is not calculating the free memory correctly. > There should be lot of free memory but it thinks there is not enough. > > > > Any idea? > > > > Thanks. > > > > > > Alok Jadhav > > CREDIT SUISSE AG > > GAT IT Hong Kong, KVAG 67 > > International Commerce Centre | Hong Kong | Hong Kong > > Phone +852 2101 6274 | Mobile +852 9169 7172 > > alok.jadhav@credit-suisse.com | www.credit-suisse.com > > > > Don't blame CPython. You're trying to do a read() of a large file, which will result in a single large string. Then you split it into lines. Why not just read it in as lines, in which case the large string isn't necessary. Take a look at the readlines() function. Chances are that even that is unnecessary, but i can't tell without seeing more of the code. lines = self.rawfile.read().split('|\n') lines = self.rawfile.readlines() When a single large item is being allocated, it's not enough to have sufficient free space, the space also has to be contiguous. After a program runs for a while, its space naturally gets fragmented more and more. it's the nature of the C runtime, and CPython is stuck with it. -- DaveA