Path: csiph.com!usenet.pasdenom.info!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'output': 0.04; 'say,': 0.05; 'that?': 0.05; 'data:': 0.07; 'file)': 0.07; 'sized': 0.07; 'python': 0.09; '(without': 0.09; '*is*': 0.09; 'currently,': 0.09; 'derived': 0.09; 'dict': 0.09; 'to:addr:python.list': 0.09; 'to:addr:tim.thechases.com': 0.09; 'to:name:tim chase': 0.09; 'cc:addr:python-list': 0.10; '(the': 0.15; "(i'm": 0.16; '*should*': 0.16; '-tkc': 0.16; '__slots__': 0.16; 'comprises': 0.16; 'iterable': 0.16; 'iterating': 0.16; 'key/value': 0.16; 'md5': 0.16; 'pairs,': 0.16; 'represents.': 0.16; 'row': 0.16; 'storing': 0.16; 'subject:usage': 0.16; 'surprising': 0.16; 'tim,': 0.16; 'string': 0.17; 'wrote:': 0.17; 'saying': 0.18; 'tim': 0.18; 'code,': 0.18; 'input': 0.18; 'memory': 0.18; 'module': 0.19; 'causing': 0.20; 'holds': 0.20; 'assuming': 0.22; 'cc:2**0': 0.23; "i've": 0.23; 'seems': 0.23; 'thus': 0.24; 'second': 0.24; 'machine': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'looks': 0.26; '(which': 0.26; 'values': 0.26; 'disk': 0.27; 'structures': 0.27; 'there.': 0.28; 'lines': 0.28; 'actual': 0.28; 'chase': 0.29; 'diagnose': 0.29; 'releases,': 0.29; 'subject:per': 0.29; 'reporting': 0.29; "i'm": 0.29; 'performing': 0.30; 'figure': 0.30; 'expect': 0.31; 'code': 0.31; '(and': 0.32; 'problem.': 0.32; 'file': 0.32; 'operate': 0.32; 'received:209.85.160.46': 0.32; 'switch': 0.32; 'correctly.': 0.33; 'hopefully': 0.33; 'received:google.com': 0.34; 'thanks': 0.34; 'massive': 0.35; 'skip:. 20': 0.35; 'stores': 0.35; 'pm,': 0.35; 'similar': 0.35; 'received:209.85': 0.35; 'but': 0.36; 'message-id:@gmail.com': 0.36; 'anything': 0.36; 'should': 0.36; 'ok,': 0.37; 'why': 0.37; 'received:209': 0.37; 'far': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'store': 0.38; 'files': 0.38; 'object': 0.38; 'some': 0.38; 'things': 0.38; 'where': 0.40; 'header:Received:5': 0.40; 'help': 0.40; 'think': 0.40; 'your': 0.60; 'first': 0.61; 'time,': 0.62; 'different': 0.63; 'more': 0.63; 'taking': 0.65; 'response.': 0.67; 'touch': 0.69; 'low': 0.83; 'are?': 0.84; "everything's": 0.84; 'serious': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=NYDFjsltego4iXC0J7/x1n+33QhmQFi6z3B5NuthN+8=; b=XLbPcm+6vs+ttM0Yr75rE66fGpYdwPeQnm3FWb9QWGMllfXiZZrsMIM5qCPHwKXEta AJHvJVlQQoy7N1g8O2AwNhlIHRUBFdpkBKOKGmbw/xd2E6Gh6xSIAkStO7W5b5iGNGx2 A3JK3Dd5ImShHqSaJfckAP+e+Eql8cKlNKK7ZxblorI7bj1OwX8GrJiIlrqcdv5x08ZX sb0davCbFqfXYcWNqoYbwGTMJgYn4ZgZeDgek40oXkELkyDxvYqQBug1n3DEot+7fvmp kCEoOoAO/0wARLHQA3ZjBTdGngp+60LPOQoR46cvo8cWjymakM7wHcuMaA8Z7w3NNyU6 Pqsw== Date: Mon, 24 Sep 2012 16:58:51 -0700 From: Junkshops User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: Tim Chase Subject: Re: Memory usage per top 10x usage per heapy References: <983c532f-3ff6-4bd2-bb48-07cf4d065a4b@googlegroups.com> <5060EB2C.6080508@tim.thechases.com> In-Reply-To: <5060EB2C.6080508@tim.thechases.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Python X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 73 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1348531142 news.xs4all.nl 6949 [2001:888:2000:d::a6]:35261 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:29971 Hi Tim, thanks for the response. > - check how you're reading the data: are you iterating over > the lines a row at a time, or are you using > .read()/.readlines() to pull in the whole file and then > operate on that? I'm using enumerate() on an iterable input (which in this case is the filehandle). > - check how you're storing them: are you holding onto more > than you think you are? I've used ipython to look through my data structures (without going into ungainly detail, 2 dicts with X numbers of key/value pairs, where X = number of lines in the file), and everything seems to be working correctly. Like I say, heapy output looks reasonable - I don't see anything surprising there. In one dict I'm storing a id string (the first token in each line of the file) with values as (again, without going into massive detail) the md5 of the contents of the line. The second dict has the md5 as the key and an object with __slots__ set that stores the line number of the file and the type of object that line represents. > Would it hurt to switch from a > dict to store your data (I'm assuming here) to using the > anydbm module to temporarily persist the large quantity of > data out to disk in order to keep memory usage lower? That's the thing though - according to heapy, the memory usage *is* low and is more or less what I expect. What I don't understand is why top is reporting such vastly different memory usage. If a memory profiler is saying everything's ok, it makes it very difficult to figure out what's causing the problem. Based on heapy, a db based solution would be serious overkill. -MrsE On 9/24/2012 4:22 PM, Tim Chase wrote: > On 09/24/12 16:59, MrsEntity wrote: >> I'm working on some code that parses a 500kb, 2M line file line >> by line and saves, per line, some derived strings into various >> data structures. I thus expect that memory use should >> monotonically increase. Currently, the program is taking up so >> much memory - even on 1/2 sized files - that on 2GB machine I'm >> thrashing swap. > It might help to know what comprises the "into various data > structures". I do a lot of ETL work on far larger files, > with similar machine specs, and rarely touch swap. > >> 2) How can I diagnose (and hopefully fix) what's causing the >> massive memory usage when it appears, from heapy, that the code >> is performing reasonably? > I seem to recall that Python holds on to memory that the VM > releases, but that it *should* reuse it later. So you'd get > the symptom of the memory-usage always increasing, never > decreasing. > > Things that occur to me: > > - check how you're reading the data: are you iterating over > the lines a row at a time, or are you using > .read()/.readlines() to pull in the whole file and then > operate on that? > > - check how you're storing them: are you holding onto more > than you think you are? Would it hurt to switch from a > dict to store your data (I'm assuming here) to using the > anydbm module to temporarily persist the large quantity of > data out to disk in order to keep memory usage lower? > > Without actual code, it's hard to do a more detailed > analysis. > > -tkc >