Groups > comp.lang.python > #29938 > unrolled thread

Memory usage per top 10x usage per heapy

Started by	MrsEntity <junkshops@gmail.com>
First post	2012-09-24 14:59 -0700
Last post	2012-09-25 18:35 -0500
Articles	20 on this page of 22 — 10 participants

Back to article view | Back to comp.lang.python

  Memory usage per top 10x usage per heapy MrsEntity <junkshops@gmail.com> - 2012-09-24 14:59 -0700
    Re: Memory usage per top 10x usage per heapy Tim Chase <python.list@tim.thechases.com> - 2012-09-24 18:22 -0500
    Re: Memory usage per top 10x usage per heapy Junkshops <junkshops@gmail.com> - 2012-09-24 16:58 -0700
      Re: Memory usage per top 10x usage per heapy bryanjugglercryptographer@yahoo.com - 2012-09-27 01:00 -0700
      Re: Memory usage per top 10x usage per heapy bryanjugglercryptographer@yahoo.com - 2012-09-27 01:00 -0700
    Re: Memory usage per top 10x usage per heapy Dave Angel <d@davea.name> - 2012-09-24 21:14 -0400
    Re: Memory usage per top 10x usage per heapy Junkshops <junkshops@gmail.com> - 2012-09-24 21:21 -0700
    Re: Memory usage per top 10x usage per heapy Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-09-25 00:41 -0400
    Re: Memory usage per top 10x usage per heapy Tim Chase <python.list@tim.thechases.com> - 2012-09-25 05:51 -0500
    Re: Memory usage per top 10x usage per heapy Dave Angel <d@davea.name> - 2012-09-25 07:06 -0400
    Re: Memory usage per top 10x usage per heapy Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-25 12:10 +0100
    Re: gracious responses (was: Memory usage per top 10x usage per heapy) Tim Chase <python.list@tim.thechases.com> - 2012-09-25 06:40 -0500
      Re: gracious responses (was: Memory usage per top 10x usage per heapy) alex23 <wuwei23@gmail.com> - 2012-09-25 05:44 -0700
        Re: gracious responses Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-25 13:53 +0100
    Re: gracious responses Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-25 12:54 +0100
      Re: gracious responses Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-25 15:17 +0000
    Re: Memory usage per top 10x usage per heapy Dave Angel <d@davea.name> - 2012-09-25 14:50 -0400
    Re: Memory usage per top 10x usage per heapy Junkshops <junkshops@gmail.com> - 2012-09-25 14:02 -0700
    Re: Memory usage per top 10x usage per heapy Junkshops <junkshops@gmail.com> - 2012-09-25 14:35 -0700
    Re: Memory usage per top 10x usage per heapy Tim Chase <python.list@tim.thechases.com> - 2012-09-25 17:10 -0500
    Re: Memory usage per top 10x usage per heapy Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-25 16:09 -0600
    Re: Memory usage per top 10x usage per heapy Tim Chase <python.list@tim.thechases.com> - 2012-09-25 18:35 -0500

Page 1 of 2 [1] 2 Next page →

#29938 — Memory usage per top 10x usage per heapy

From	MrsEntity <junkshops@gmail.com>
Date	2012-09-24 14:59 -0700
Subject	Memory usage per top 10x usage per heapy
Message-ID	<983c532f-3ff6-4bd2-bb48-07cf4d065a4b@googlegroups.com>

Hi all,

I'm working on some code that parses a 500kb, 2M line file line by line and saves, per line, some derived strings into various data structures. I thus expect that memory use should monotonically increase. Currently, the program is taking up so much memory - even on 1/2 sized files - that on 2GB machine I'm thrashing swap. What's strange is that heapy (http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x less memory than reported by top, and the heapy data seems consistent with what I was expecting based on the objects the code stores. I tried using memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't really provide any illuminating information. The code does create and discard a number of objects per line of the file, but they should not be stored anywhere, and heapy seems to confirm that. So, my questions are:

1) For those of you kind enough to help me figure out what's going on, what additional data would you like? I didn't want swamp everyone with the code and heapy/memory_profiler output but I can do so if it's valuable.
2) How can I diagnose (and hopefully fix) what's causing the massive memory usage when it appears, from heapy, that the code is performing reasonably?

Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64

Thanks very much.

[toc] | [next] | [standalone]

#29958

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-24 18:22 -0500
Message-ID	<mailman.1241.1348528879.27098.python-list@python.org>
In reply to	#29938

On 09/24/12 16:59, MrsEntity wrote:
> I'm working on some code that parses a 500kb, 2M line file line
> by line and saves, per line, some derived strings into various
> data structures. I thus expect that memory use should
> monotonically increase. Currently, the program is taking up so
> much memory - even on 1/2 sized files - that on 2GB machine I'm
> thrashing swap.

It might help to know what comprises the "into various data
structures".  I do a lot of ETL work on far larger files,
with similar machine specs, and rarely touch swap.

> 2) How can I diagnose (and hopefully fix) what's causing the
> massive memory usage when it appears, from heapy, that the code
> is performing reasonably?

I seem to recall that Python holds on to memory that the VM
releases, but that it *should* reuse it later.  So you'd get
the symptom of the memory-usage always increasing, never
decreasing.

Things that occur to me:

- check how you're reading the data:  are you iterating over
  the lines a row at a time, or are you using
  .read()/.readlines() to pull in the whole file and then
  operate on that?

- check how you're storing them:  are you holding onto more
  than you think you are?  Would it hurt to switch from a
  dict to store your data (I'm assuming here) to using the
  anydbm module to temporarily persist the large quantity of
  data out to disk in order to keep memory usage lower?

Without actual code, it's hard to do a more detailed
analysis.

-tkc

[toc] | [prev] | [next] | [standalone]

#29971

From	Junkshops <junkshops@gmail.com>
Date	2012-09-24 16:58 -0700
Message-ID	<mailman.1247.1348531142.27098.python-list@python.org>
In reply to	#29938

Hi Tim, thanks for the response.

> - check how you're reading the data:  are you iterating over
>    the lines a row at a time, or are you using
>    .read()/.readlines() to pull in the whole file and then
>    operate on that?
I'm using enumerate() on an iterable input (which in this case is the 
filehandle).

> - check how you're storing them:  are you holding onto more
>    than you think you are?
I've used ipython to look through my data structures (without going into 
ungainly detail, 2 dicts with X numbers of key/value pairs, where X = 
number of lines in the file), and everything seems to be working 
correctly. Like I say, heapy output looks reasonable - I don't see 
anything surprising there. In one dict I'm storing a id string (the 
first token in each line of the file) with values as (again, without 
going into massive detail) the md5 of the contents of the line. The 
second dict has the md5 as the key and an object with __slots__ set that 
stores the line number of the file and the type of object that line 
represents.

> Would it hurt to switch from a
>    dict to store your data (I'm assuming here) to using the
>    anydbm module to temporarily persist the large quantity of
>    data out to disk in order to keep memory usage lower?
That's the thing though - according to heapy, the memory usage *is* low 
and is more or less what I expect. What I don't understand is why top is 
reporting such vastly different memory usage. If a memory profiler is 
saying everything's ok, it makes it very difficult to figure out what's 
causing the problem. Based on heapy, a db based solution would be 
serious overkill.

-MrsE

On 9/24/2012 4:22 PM, Tim Chase wrote:
> On 09/24/12 16:59, MrsEntity wrote:
>> I'm working on some code that parses a 500kb, 2M line file line
>> by line and saves, per line, some derived strings into various
>> data structures. I thus expect that memory use should
>> monotonically increase. Currently, the program is taking up so
>> much memory - even on 1/2 sized files - that on 2GB machine I'm
>> thrashing swap.
> It might help to know what comprises the "into various data
> structures".  I do a lot of ETL work on far larger files,
> with similar machine specs, and rarely touch swap.
>
>> 2) How can I diagnose (and hopefully fix) what's causing the
>> massive memory usage when it appears, from heapy, that the code
>> is performing reasonably?
> I seem to recall that Python holds on to memory that the VM
> releases, but that it *should* reuse it later.  So you'd get
> the symptom of the memory-usage always increasing, never
> decreasing.
>
> Things that occur to me:
>
> - check how you're reading the data:  are you iterating over
>    the lines a row at a time, or are you using
>    .read()/.readlines() to pull in the whole file and then
>    operate on that?
>
> - check how you're storing them:  are you holding onto more
>    than you think you are?  Would it hurt to switch from a
>    dict to store your data (I'm assuming here) to using the
>    anydbm module to temporarily persist the large quantity of
>    data out to disk in order to keep memory usage lower?
>
> Without actual code, it's hard to do a more detailed
> analysis.
>
> -tkc
>

[toc] | [prev] | [next] | [standalone]

#30281

From	bryanjugglercryptographer@yahoo.com
Date	2012-09-27 01:00 -0700
Message-ID	<a1e5b0de-2a05-47dd-bbc5-723a4305888c@googlegroups.com>
In reply to	#29971

MrsEntity wrote:
> Based on heapy, a db based solution would be serious overkill.

I've embraced overkill and my life is better for it. Don't confuse overkill with cost. Overkill is your friend.

The facts of the case: You need to save some derived strings for each of 2M input lines. Even half the input runs over the 2GB RAM in your (virtual) machine. You're using Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64.

That screams "sqlite3". It's overkill, in a good way. It's already there for the importing.

Other approaches? You could try to keep everything in RAM, but use less. Tim Chase pointed out the memory-efficiency of named tuples. You could save some more by switching to Win7/32, Python 2.7/32; VirtualBox makes trying such alternatives quick and easy.

Or you could add memory. Compared to good old 32-bit, 64-bit operation consumes significantly more memory and supports vastly more memory. There's a bit of a mis-match in a 64-bit system with just 2GB of RAM. I know, sounds weird, "just" two billion bytes of RAM. I'll rephrase: just ten dollars worth of RAM. Less if you buy it where I do.

I don't know why the memory profiling tools are misleading you. I can think of plausible explanations, but they'd just be guesses. There's nothing all that surprising in running out of RAM, given what you've explained. A couple K per line is easy to burn. 

-Bryan

[toc] | [prev] | [next] | [standalone]

#30282

From	bryanjugglercryptographer@yahoo.com
Date	2012-09-27 01:00 -0700
Message-ID	<mailman.1479.1348732860.27098.python-list@python.org>
In reply to	#29971

MrsEntity wrote:
> Based on heapy, a db based solution would be serious overkill.

I've embraced overkill and my life is better for it. Don't confuse overkill with cost. Overkill is your friend.

The facts of the case: You need to save some derived strings for each of 2M input lines. Even half the input runs over the 2GB RAM in your (virtual) machine. You're using Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64.

That screams "sqlite3". It's overkill, in a good way. It's already there for the importing.

Other approaches? You could try to keep everything in RAM, but use less. Tim Chase pointed out the memory-efficiency of named tuples. You could save some more by switching to Win7/32, Python 2.7/32; VirtualBox makes trying such alternatives quick and easy.

Or you could add memory. Compared to good old 32-bit, 64-bit operation consumes significantly more memory and supports vastly more memory. There's a bit of a mis-match in a 64-bit system with just 2GB of RAM. I know, sounds weird, "just" two billion bytes of RAM. I'll rephrase: just ten dollars worth of RAM. Less if you buy it where I do.

I don't know why the memory profiling tools are misleading you. I can think of plausible explanations, but they'd just be guesses. There's nothing all that surprising in running out of RAM, given what you've explained. A couple K per line is easy to burn. 

-Bryan

[toc] | [prev] | [next] | [standalone]

#29987

From	Dave Angel <d@davea.name>
Date	2012-09-24 21:14 -0400
Message-ID	<mailman.1260.1348535702.27098.python-list@python.org>
In reply to	#29938

On 09/24/2012 05:59 PM, MrsEntity wrote:
> Hi all,
>
> I'm working on some code that parses a 500kb, 2M line file 

Just curious;  which is it, two million lines, or half a million bytes?

> line by line and saves, per line, some derived strings into various data structures. I thus expect that memory use should monotonically increase. Currently, the program is taking up so much memory - even on 1/2 sized files - that on 2GB machine 

which machine is 2gb, the Windows machine, or the VM?  You could get
thrashing at either level.

> I'm thrashing swap. What's strange is that heapy (http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x less memory than reported by top, and the heapy data seems consistent with what I was expecting based on the objects the code stores. I tried using memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't really provide any illuminating information. The code does create and discard a number of objects per line of the file, but they should not be stored anywhere, and heapy seems to confirm that. So, my questions are:
>
> 1) For those of you kind enough to help me figure out what's going on, what additional data would you like? I didn't want swamp everyone with the code and heapy/memory_profiler output but I can do so if it's valuable.
> 2) How can I diagnose (and hopefully fix) what's causing the massive memory usage when it appears, from heapy, that the code is performing reasonably?
>
> Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64
>
> Thanks very much.

Tim raised most of my concerns, but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system.  The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained.  It's not normally a
problem, since most small blocks are reused.  But it can get
fragmented.  And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#30005

From	Junkshops <junkshops@gmail.com>
Date	2012-09-24 21:21 -0700
Message-ID	<mailman.1267.1348546870.27098.python-list@python.org>
In reply to	#29938

> Just curious;  which is it, two million lines, or half a million bytes?
I have, in fact, this very afternoon, invented a means of writing a 
carriage return character using only 2 bits of information. I am 
prepared to sell licenses to this revolutionary technology for the low 
price of $29.95 plus tax.

Sorry, that should've been a 500Mb, 2M line file.

> which machine is 2gb, the Windows machine, or the VM?
VM. Winders is 4gb.

> ...but I would point out that just because
> you free up the memory from the Python doesn't mean it gets released
> back to the system.  The C runtime manages its own heap, and is pretty
> persistent about hanging onto memory once obtained.  It's not normally a
> problem, since most small blocks are reused.  But it can get
> fragmented.  And i have no idea how well Virtual Box maps the Linux
> memory map into the Windows one.
Right, I understand that - but what's confusing me is that, given the 
memory use is (I assume) monotonically increasing, the code should never 
use more than what's reported by heapy once all the data is loaded into 
memory, given that memory released by the code to the Python runtime is 
reused. To the best of my ability to tell I'm not storing anything I 
shouldn't, so the only thing I can think of is that all the object 
creation and destruction, for some reason, it preventing reuse of 
memory. I'm at a bit of a loss regarding what to try next.

Cheers, MrsE

On 9/24/2012 6:14 PM, Dave Angel wrote:
> On 09/24/2012 05:59 PM, MrsEntity wrote:
>> Hi all,
>>
>> I'm working on some code that parses a 500kb, 2M line file
> Just curious;  which is it, two million lines, or half a million bytes?
>
>> line by line and saves, per line, some derived strings into various data structures. I thus expect that memory use should monotonically increase. Currently, the program is taking up so much memory - even on 1/2 sized files - that on 2GB machine
> which machine is 2gb, the Windows machine, or the VM?  You could get
> thrashing at either level.
>
>> I'm thrashing swap. What's strange is that heapy (http://guppy-pe.sourceforge.net/) is showing that the code uses about 10x less memory than reported by top, and the heapy data seems consistent with what I was expecting based on the objects the code stores. I tried using memory_profiler (http://pypi.python.org/pypi/memory_profiler) but it didn't really provide any illuminating information. The code does create and discard a number of objects per line of the file, but they should not be stored anywhere, and heapy seems to confirm that. So, my questions are:
>>
>> 1) For those of you kind enough to help me figure out what's going on, what additional data would you like? I didn't want swamp everyone with the code and heapy/memory_profiler output but I can do so if it's valuable.
>> 2) How can I diagnose (and hopefully fix) what's causing the massive memory usage when it appears, from heapy, that the code is performing reasonably?
>>
>> Specs: Ubuntu 12.04 in Virtualbox on Win7/64, Python 2.7/64
>>
>> Thanks very much.
> Tim raised most of my concerns, but I would point out that just because
> you free up the memory from the Python doesn't mean it gets released
> back to the system.  The C runtime manages its own heap, and is pretty
> persistent about hanging onto memory once obtained.  It's not normally a
> problem, since most small blocks are reused.  But it can get
> fragmented.  And i have no idea how well Virtual Box maps the Linux
> memory map into the Windows one.
>
>
>

[toc] | [prev] | [next] | [standalone]

#30009

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-09-25 00:41 -0400
Message-ID	<mailman.1270.1348548084.27098.python-list@python.org>
In reply to	#29938

On Mon, 24 Sep 2012 14:59:47 -0700 (PDT), MrsEntity
<junkshops@gmail.com> declaimed the following in
gmane.comp.python.general:

> Hi all,
> 
> I'm working on some code that parses a 500kb, 2M line file line by line and saves, per line, some derived strings

	Pardon? A 2million line file will contain, at the minimum 2million
line-end characters. That four times 500kB just in the line-ends,
ignoring any data.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#30059

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-25 05:51 -0500
Message-ID	<mailman.1311.1348570196.27098.python-list@python.org>
In reply to	#29938

On 09/24/12 23:41, Dennis Lee Bieber wrote:
> On Mon, 24 Sep 2012 14:59:47 -0700 (PDT), MrsEntity
> <junkshops@gmail.com> declaimed the following in
> gmane.comp.python.general:
> 
>> Hi all,
>>
>> I'm working on some code that parses a 500kb, 2M line file line by line and saves, per line, some derived strings
> 
> 	Pardon? A 2million line file will contain, at the minimum 2million
> line-end characters. That four times 500kB just in the line-ends,
> ignoring any data.

As corrected later in the thread, MrsEntity writes

"""
I have, in fact, this very afternoon, invented a means of writing a
carriage return character using only 2 bits of information. I am
prepared to sell licenses to this revolutionary technology for the
low price of $29.95 plus tax.

Sorry, that should've been a 500Mb, 2M line file.
"""

If only other unnamed persons on the list were so gracious rather
than turning the flame-dial to 11.

I hope that when people come to the list, *this* is what they see,
laugh, and want to participate.

Although, MrsEntity could be zombie David A. Huffman, whose encoding
scheme actually *can* store 2M lines in 500kb :-)

-tkc

[toc] | [prev] | [next] | [standalone]

#30065

From	Dave Angel <d@davea.name>
Date	2012-09-25 07:06 -0400
Message-ID	<mailman.1318.1348571218.27098.python-list@python.org>
In reply to	#29938

On 09/25/2012 12:21 AM, Junkshops wrote:
>> Just curious;  which is it, two million lines, or half a million bytes?
<snip>
> 
> Sorry, that should've been a 500Mb, 2M line file.
> 
>> which machine is 2gb, the Windows machine, or the VM?
> VM. Winders is 4gb.
> 
>> ...but I would point out that just because
>> you free up the memory from the Python doesn't mean it gets released
>> back to the system.  The C runtime manages its own heap, and is pretty
>> persistent about hanging onto memory once obtained.  It's not normally a
>> problem, since most small blocks are reused.  But it can get
>> fragmented.  And i have no idea how well Virtual Box maps the Linux
>> memory map into the Windows one.
> Right, I understand that - but what's confusing me is that, given the
> memory use is (I assume) monotonically increasing, the code should never
> use more than what's reported by heapy once all the data is loaded into
> memory, given that memory released by the code to the Python runtime is
> reused. To the best of my ability to tell I'm not storing anything I
> shouldn't, so the only thing I can think of is that all the object
> creation and destruction, for some reason, it preventing reuse of
> memory. I'm at a bit of a loss regarding what to try next.

I'm not familiar with heapy, but perhaps it's missing something there.
I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file.  You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way.  I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.

Perhaps one way to save space would be to use a long to store those md5
values.  You'd have to measure it, but I suspect it'd help (at the cost
of lots of extra hexlify-type calls).  Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#30066

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-25 12:10 +0100
Message-ID	<mailman.1320.1348571392.27098.python-list@python.org>
In reply to	#29938

On 25/09/2012 11:51, Tim Chase wrote:
[snip]
>
> If only other unnamed persons on the list were so gracious rather
> than turning the flame-dial to 11.
>

Oh heck what have I said this time?

>
> -tkc

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#30069 — Re: gracious responses (was: Memory usage per top 10x usage per heapy)

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-25 06:40 -0500
Subject	Re: gracious responses (was: Memory usage per top 10x usage per heapy)
Message-ID	<mailman.1325.1348573147.27098.python-list@python.org>
In reply to	#29938

On 09/25/12 06:10, Mark Lawrence wrote:
> On 25/09/2012 11:51, Tim Chase wrote:
>> If only other unnamed persons on the list were so gracious rather
>> than turning the flame-dial to 11.
>>
> 
> Oh heck what have I said this time?

You'd *like* to take credit?  ;-)

Nah, not you or any of the regulars here.  The comment was regarding
the flame-fest that's been running in some parallel threads over the
last ~12hr or so.  Mostly instigated by one person with a
particularly quick trigger, vitriolic tongue, and a disregard for
pythonic code.

-tkc

[toc] | [prev] | [next] | [standalone]

#30073 — Re: gracious responses (was: Memory usage per top 10x usage per heapy)

From	alex23 <wuwei23@gmail.com>
Date	2012-09-25 05:44 -0700
Subject	Re: gracious responses (was: Memory usage per top 10x usage per heapy)
Message-ID	<d307cdde-3db7-4dd8-9b27-92324aab5449@im7g2000pbc.googlegroups.com>
In reply to	#30069

On Sep 25, 9:39 pm, Tim Chase <python.l...@tim.thechases.com> wrote:
> Mostly instigated by one person with a
> particularly quick trigger, vitriolic tongue, and a disregard for
> pythonic code.

I'm sorry. I'll get me coat.

[toc] | [prev] | [next] | [standalone]

#30075 — Re: gracious responses

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-25 13:53 +0100
Subject	Re: gracious responses
Message-ID	<mailman.1331.1348577591.27098.python-list@python.org>
In reply to	#30073

On 25/09/2012 13:44, alex23 wrote:
> On Sep 25, 9:39 pm, Tim Chase <python.l...@tim.thechases.com> wrote:
>> Mostly instigated by one person with a
>> particularly quick trigger, vitriolic tongue, and a disregard for
>> pythonic code.
>
> I'm sorry. I'll get me coat.
>

Oi, back of the queue if you don't mind :)

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#30070 — Re: gracious responses

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-09-25 12:54 +0100
Subject	Re: gracious responses
Message-ID	<mailman.1326.1348573932.27098.python-list@python.org>
In reply to	#29938

On 25/09/2012 12:40, Tim Chase wrote:
> On 09/25/12 06:10, Mark Lawrence wrote:
>> On 25/09/2012 11:51, Tim Chase wrote:
>>> If only other unnamed persons on the list were so gracious rather
>>> than turning the flame-dial to 11.
>>>
>>
>> Oh heck what have I said this time?
>
> You'd *like* to take credit?  ;-)
>
> Nah, not you or any of the regulars here.  The comment was regarding
> the flame-fest that's been running in some parallel threads over the
> last ~12hr or so.  Mostly instigated by one person with a
> particularly quick trigger, vitriolic tongue, and a disregard for
> pythonic code.
>
> -tkc
>
>

Well thank goodness for that.  Of course the person to whom you've 
alluded has been defended over on the tutor mailing list, seriously, and 
as I've said elsewhere after referring to my family as pigs!!!

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#30100 — Re: gracious responses

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-25 15:17 +0000
Subject	Re: gracious responses
Message-ID	<5061cb1f$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#30070

On Tue, 25 Sep 2012 12:54:05 +0100, Mark Lawrence wrote:

> Well thank goodness for that.  Of course the person to whom you've
> alluded has been defended over on the tutor mailing list, seriously, and
> as I've said elsewhere after referring to my family as pigs!!!

Since pigs are at least as intelligent as dogs, and in their natural 
state nowhere near as filthy as the stereotype of the pig in a sty, that 
isn't as big an insult as it was intended.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#30119

From	Dave Angel <d@davea.name>
Date	2012-09-25 14:50 -0400
Message-ID	<mailman.1367.1348599064.27098.python-list@python.org>
In reply to	#29938

On 09/25/2012 01:39 PM, Junkshops wrote:

Procedural point:  I know you're trying to conform to the standard that
this mailing list uses, but you're off a little, and it's distracting.
It's also probably more work for you, and certainly for us.

You need an attribution in front of the quoted portions.  This next
section is by me, but you don't say so.  That's because you copy/pasted
it from elsewhere in the reply, and didn't copy the "... Dave Angel
wrote" part.

Much easier is to take the reply, and remove the parts you're not going
to respond to, putting your own comments in between the parts that are
left (as you're doing).  And generally, there's no need for anything
after your last remark, so you just delete up to your signature, if any.

>> I'm a bit surprised you aren't beyond the 2gb limit, just with the
>> structures you describe for the file.  You do realize that each object
>> has quite a few bytes of overhead, so it's not surprising to use several
>> times the size of a file, to store the file in an organized way.
> I did some back of the envelope calcs which more or less agreed with
> heapy. The code stores 1 string, which is, on average, about 50 chars or
> so, and one MD5 hex string per line of code. There's about 40 bytes or
> so of overhead per string per sys.getsizeof(). I'm also storing an int
> (24b) and a <10 char string in an object with __slots__ set. Each
> object, per heapy (this is one area where I might be underestimating
> things) takes 64 bytes plus instance variable storage, so per line:
> 
> 50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines = ~600MB
> plus some memory for the dicts, which is about what heapy is reporting
> (note I'm currently not actually running all 2M lines, I'm just running
> subsets for my tests).
> 
> Is there something I'm missing? Here's the heapy output after loading
> ~300k lines:
> 
> Partition of a set of 1199849 objects. Total size = 89965376 bytes.
> Index     Count     %     Size     %     Cumulative     %     Kind
> 0     599999     50     38399920     43     38399920     43     str
> 1     5     0     25167224     28     63567144     71     dict
> 2     299998     25     19199872     21     82767016     92     0xa13330
> 3     299836     25     7196064     8     89963080     100     int
> 4     4     0     1152     0     89964232     100    
> collections.defaultdict
> 
> Note that 3 of the dicts are empty. I assumet 0xa13330 is the
> address of the object. I'd actually expect to see 900k strings, but the
> <10 char string is always the same in this case so perhaps the runtime
> is using the same object...? 

CPython currently interns short strings that conform to variable name
rules.  You can't count on that behavior (and i probably don't have it
quite right anyway), but it's probably what you're seeing.

> At this point, top reports python as using
> 1.1g of virt and 1.0g of res.
> 
>> I also
>> wonder if heapy has been written to take into account the larger size of
>> pointers in a 64bit build.
> That I don't know, but that would only explain, at most, a 2x increase
> in memory over the heapy report, wouldn't it? Not the ~10x I'm seeing.
> 
>> Another thing is to make sure
>> that the md5 object used in your two maps is the same object, and not
>> just one with the same value.
> That's certainly the way the code is written, and heapy seems to confirm
> that the strings aren't duplicated in memory.
> 
> Thanks for sticking with me on this,

You're certainly welcome.  I suspect that heapy has some limitation in
its reporting, and that's what the discrepancy.  Oscar points out that
you have a bunch of exception objects, which certainly looks suspicious.
 If you're somehow storing one of these per line, and heapy isn't
reporting them, that could be a large discrepancy.

He also points out that you have a couple of lambda functions stored in
one of your dictionary.  A lambda function can be an expensive
proposition if you are building millions of them.  So can nested
functions with non-local variable references, in case you have any of those.

Oscar also reminds you of what I suggested for the md5 fields.  Stored
as ints instead of hex strings could save a good bit.  Just remember to
use the same one for both dicts, as you've been doing with the strings.

Other than that, I'm stumped.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#30127

From	Junkshops <junkshops@gmail.com>
Date	2012-09-25 14:02 -0700
Message-ID	<mailman.1377.1348606988.27098.python-list@python.org>
In reply to	#29938

On 9/25/2012 11:50 AM, Dave Angel wrote:
> I suspect that heapy has some limitation in its reporting, and that's 
> what the discrepancy.

That would be my first suspicion as well - except that heapy's results 
agree so well with what I expect, and I can't think of any reason I'd be 
using 10x more memory. If heapy is wrong, then I need to try and figure 
out what's using up all that memory some other way... but I don't know 
what that way might be.

> ... can be an expensive proposition if you are building millions of 
> them. So can nested functions with non-local variable references, in 
> case you have any of those. 

Not as far as I know.

Cheers, MrsEntity

[toc] | [prev] | [next] | [standalone]

#30128

From	Junkshops <junkshops@gmail.com>
Date	2012-09-25 14:35 -0700
Message-ID	<mailman.1379.1348608941.27098.python-list@python.org>
In reply to	#29938

On 9/25/2012 2:17 PM, Oscar Benjamin wrote:
> I don't know whether it would be better or worse but it might be worth 
> seeing what happens if you replace the FileContext objects with tuples.
I originally used a string, and it was slightly better since you don't 
have the object overhead, but I wanted to code to an interface for the 
context information so started a Context abstract class that FileContext 
inherits from (both have __slots__ set). Using an object without 
__slots__ set was a disaster. However, the difference between a string 
and an object with __slots__ isn't severe.

>
> I can't see anything wrong with that but then I'm not sure if the 
> lambda function always keeps its frame alive. If there's only that one 
> line in the __init__ function then I'd expect it to be fine.

That's it, I'm afraid.

>
> Perhaps you could see what objgraph comes up with:
> http://pypi.python.org/pypi/objgraph
>
> So far as I know objgraph doesn't tell you how big objects are but it 
> does give a nice graphical representation of which objects are alive 
> and which other objects they are referenced by. You might find that 
> some other object is kept alive that you didn't expect.
>
I'll give it a shot and see what happens.

Cheers, MrsEntity

[toc] | [prev] | [next] | [standalone]

#30129

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-09-25 17:10 -0500
Message-ID	<mailman.1380.1348610978.27098.python-list@python.org>
In reply to	#29938

On 09/25/12 16:17, Oscar Benjamin wrote:
> I don't know whether it would be better or worse but it might be
> worth seeing what happens if you replace the FileContext objects
> with tuples.

If tuples provide a savings but you find them opaque, you might also
consider named-tuples for clarity.

-tkc

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Memory usage per top 10x usage per heapy

Contents

#29938 — Memory usage per top 10x usage per heapy

#29958

#29971

#30281

#30282

#29987

#30005

#30009

#30059

#30065

#30066

#30069 — Re: gracious responses (was: Memory usage per top 10x usage per heapy)

#30073 — Re: gracious responses (was: Memory usage per top 10x usage per heapy)

#30075 — Re: gracious responses

#30070 — Re: gracious responses

#30100 — Re: gracious responses

#30119

#30127

#30128

#30129