Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: RE: Python garbage collector/memory manager behaving strangely
Date: Mon, 17 Sep 2012 10:28:34 +0800
In-Reply-To: <5056871E.7050206@davea.name>
Thread-Topic: Python garbage collector/memory manager behaving strangely
Thread-Index: Ac2UegOwkmMWnCNMTty7kundzNhRDgAAPOhg
References: <CEE8C35195DB944D9C75ABB15A04193B14E77085@EHKG17P32001A.csfb.cs-group.com> <5056871E.7050206@davea.name>
From: "Jadhav, Alok" <alok.jadhav@credit-suisse.com>
To: <d@davea.name>
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.818.1347849124.27098.python-list@python.org>
Lines: 175
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:29357

Thanks Dave for clean explanation. I clearly understand what is going on
now. I still need some suggestions from you on this. 

There are 2 reasons why I was using  self.rawfile.read().split('|\n')
instead of self.rawfile.readlines()

- As you have seen, the line separator is not '\n' but its '|\n'.
Sometimes the data itself has '\n' characters in the middle of the line
and only way to find true end of the line is that previous character
should be a bar '|'. I was not able specify end of line using
readlines() function, but I could do it using split() function.
(One hack would be to readlines and combine them until I find '|\n'. is
there a cleaner way to do this?)
- Reading whole file at once and processing line by line was must
faster. Though speed is not of very important issue here but I think the
tie it took to parse complete file was reduced to one third of original
time.

Regards,
Alok


-----Original Message-----
From: Dave Angel [mailto:d@davea.name] 
Sent: Monday, September 17, 2012 10:13 AM
To: Jadhav, Alok
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely

On 09/16/2012 09:07 PM, Jadhav, Alok wrote:
> Hi Everyone,
>
>  
>
> I have a simple program which reads a large file containing few
million
> rows, parses each row (`numpy array`) and converts into an array of
> doubles (`python array`) and later writes into an `hdf5 file`. I
repeat
> this loop for multiple days. After reading each file, i delete all the
> objects and call garbage collector.  When I run the program, First day
> is parsed without any error but on the second day i get `MemoryError`.
I
> monitored the memory usage of my program, during first day of parsing,
> memory usage is around **1.5 GB**. When the first day parsing is
> finished, memory usage goes down to **50 MB**. Now when 2nd day starts
> and i try to read the lines from the file I get `MemoryError`.
Following
> is the output of the program.
>
>  
>
>  
>
>     source file extracted at C:\rfadump\au\2012.08.07.txt
>
>     parsing started
>
>     current time: 2012-09-16 22:40:16.829000
>
>     500000 lines parsed
>
>     1000000 lines parsed
>
>     1500000 lines parsed
>
>     2000000 lines parsed
>
>     2500000 lines parsed
>
>     3000000 lines parsed
>
>     3500000 lines parsed
>
>     4000000 lines parsed
>
>     4500000 lines parsed
>
>     5000000 lines parsed
>
>     parsing done.
>
>     end time is 2012-09-16 23:34:19.931000
>
>     total time elapsed 0:54:03.102000
>
>     repacking file
>
>     done
>
>     >
s:\users\aaj\projects\pythonhf\rfadumptohdf.py(132)generateFiles()
>
>     -> while single_date <= self.end_date:
>
>     (Pdb) c
>
>     *** 2012-08-08 ***
>
>     source file extracted at C:\rfadump\au\2012.08.08.txt
>
>     cought an exception while generating file for day 2012-08-08.
>
>     Traceback (most recent call last):
>
>       File "rfaDumpToHDF.py", line 175, in generateFile
>
>         lines = self.rawfile.read().split('|\n')
>
>     MemoryError
>
>  
>
> I am very sure that windows system task manager shows the memory usage
> as **50 MB** for this process. It looks like the garbage collector or
> memory manager for Python is not calculating the free memory
correctly.
> There should be lot of free memory but it thinks there is not enough. 
>
>  
>
> Any idea?
>
>  
>
> Thanks.
>
>  
>
>  
>
> Alok Jadhav
>
> CREDIT SUISSE AG
>
> GAT IT Hong Kong, KVAG 67
>
> International Commerce Centre | Hong Kong | Hong Kong
>
> Phone +852 2101 6274 | Mobile +852 9169 7172
>
> alok.jadhav@credit-suisse.com | www.credit-suisse.com
> <http://www.credit-suisse.com/> 
>
>  
>

Don't blame CPython.  You're trying to do a read() of a large file,
which will result in a single large string.  Then you split it into
lines.  Why not just read it in as lines, in which case the large string
isn't necessary.   Take a look at the readlines() function.  Chances are
that even that is unnecessary, but i can't tell without seeing more of
the code.

  lines = self.rawfile.read().split('|\n')

   lines = self.rawfile.readlines()

When a single large item is being allocated, it's not enough to have
sufficient free space, the space also has to be contiguous.  After a
program runs for a while, its space naturally gets fragmented more and
more.  it's the nature of the C runtime, and CPython is stuck with it.



-- 

DaveA


=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
===============================================================================