Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!194.109.133.84.MISMATCH!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Mon, 01 Aug 2011 17:29:08 +0200
From: Thomas Jollans <t@jollybox.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:5.0) Gecko/20110628 Thunderbird/5.0
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: python reading file memory cost
References: <000f01cc505c$74e01e80$5ea05b80$@com>
In-Reply-To: <000f01cc505c$74e01e80$5ea05b80$@com>
OpenPGP: id=5C8691ED
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1729.1312212539.1164.python-list@python.org>
Lines: 32
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:10672

On 01/08/11 17:05, Tong Zhang wrote:
> Hello, everyone!
> 
>  
> 
> I am trying to read a little big txt file (~1 GB) by python2.7, what I
> want to do is to read these data into a array, meanwhile, I monitor the
> memory cost, I found that it cost more than 6 GB RAM! So I have two
> questions:
> 
> 1: How to estimate memory cost before exec python script?
> 
> 2: How to save RAM while do not increase exec time?

How are you reading the file? If you are using file_object.read(),
.readlines(), or similar, to read the whole file at once: don't. This is
a tremendous waste of memory, and probably slows things down. Usually,
the best approach is to iterate over the file object itself (for line in
file_object: # process line)

Without knowing what you're doing with the data (or, what "data" is
here), we can't really do much to help you. My best guess would be that
you're unnecessarily storing the data multiple times.

Perhaps you can use the csv module? Do you really need to hold all the
data in memory all the time, or can you process the data in the order it
is in the file, never actually holding more than one (or a few) records
in memory? With generators, Python has excellent support for working
with streams of data like this. (and it would save you a lot of RAM)

 - Thomas