Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #102888 > unrolled thread

What is heating the memory here? hashlib?

Started byPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
First post2016-02-13 19:29 +0000
Last post2016-02-15 17:29 +0000
Articles 13 — 5 participants

Back to article view | Back to comp.lang.python


Contents

  What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-13 19:29 +0000
    Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-13 22:26 +0000
      Re: What is heating the memory here? hashlib? Chris Angelico <rosuav@gmail.com> - 2016-02-14 09:45 +1100
        Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-14 01:44 +0000
          Re: What is heating the memory here? hashlib? Chris Angelico <rosuav@gmail.com> - 2016-02-14 13:01 +1100
    Re: What is heating the memory here? hashlib? Steven D'Aprano <steve@pearwood.info> - 2016-02-14 13:21 +1100
      Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 08:05 +0000
    Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-14 07:04 +0000
      Re: What is heating the memory here? hashlib? INADA Naoki <songofacandy@gmail.com> - 2016-02-14 18:49 +0900
        Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 07:38 +0000
      Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 02:21 +0000
        Re: What is heating the memory here? hashlib? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2016-02-15 09:12 +0100
          Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 17:29 +0000

#102888 — What is heating the memory here? hashlib?

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-13 19:29 +0000
SubjectWhat is heating the memory here? hashlib?
Message-ID<n9o06t$1hjo$1@gioia.aioe.org>
Hello all.

I'm running in a very strange (for me at least) problem.

	def getHash(self):
		bfsz=File.blksz
		h=hashlib.sha256()
		hu=h.update
		with open(self.getPath(),'rb') as f:
			f.seek(File.hdrsz)	# Skip header
			b=f.read(bfsz)
			while len(b)>0:
				hu(b)
				b=f.read(bfsz)
		fhash=h.digest()
		return fhash

hdrsz is always 4K here. All files are greater than 4K.

If I use a 40MB bfsz this tooks all my memory very quickly. After few
hundreds of files it begins to swap ending up with the program being
killed (BTW, I'm using linux kubuntu 14.04).

If I reduce bfsz to 1MB it successfully completes my full test (~100000
files) reaching about 6GB of memory.

If I reduce further bfsz to 16KB there is no noticeable memory taken!!

I have tried the following code, but it didn't fix the problem:

	def getHash(self):
		bfsz=File.blksz
		h=hashlib.sha256()
		hu=h.update
		with open(self.getPath(),'rb') as f:
			husz=8192
			f.seek(File.hdrsz)	# Skip header
			b=f.read(bfsz)
			while len(b)>0:
				for i in range(0,len(b),husz):
					hu(b[i:i+husz])
				b=f.read(bfsz)
		fhash=h.digest()
		return fhash

What is wrong here?!

Thanks for any help/comments.
Paulo

[toc] | [next] | [standalone]


#102893

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-13 22:26 +0000
Message-ID<n9oaja$39n$1@gioia.aioe.org>
In reply to#102888
I meant eating! :-)

[toc] | [prev] | [next] | [standalone]


#102894

FromChris Angelico <rosuav@gmail.com>
Date2016-02-14 09:45 +1100
Message-ID<mailman.99.1455403508.22075.python-list@python.org>
In reply to#102893
On Sun, Feb 14, 2016 at 9:26 AM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
> I meant eating! :-)

Heh, "heating" works too - the more you use memory, the more it heats up :)

I'm assuming this is inside "class File:" and you have class members
for your constants like header size? There's no context for the name
"File" otherwise.

What happens if, after hashing each file (and returning from this
function), you call gc.collect()? If that reduces your RAM usage, you
have reference cycles somewhere.

ChrisA

[toc] | [prev] | [next] | [standalone]


#102897

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-14 01:44 +0000
Message-ID<n9om6l$hkt$1@gioia.aioe.org>
In reply to#102894
Às 22:45 de 13-02-2016, Chris Angelico escreveu:
> On Sun, Feb 14, 2016 at 9:26 AM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>> I meant eating! :-)
> 
> Heh, "heating" works too - the more you use memory, the more it heats up :)
:-) It is heating my head!
...

> 
> What happens if, after hashing each file (and returning from this
> function), you call gc.collect()? If that reduces your RAM usage, you
> have reference cycles somewhere.
> 
I have used gc and del. No luck.

The most probable cause seems to be hashlib not correctly handling big
buffers updates. I am working in a computer and testing in another. For
the second part may be somehow I forgot to transfer the change to the
other computer. Unlikely but possible.

Anyway it is doing its job right now with bfsz=16KB (this takes a few
hours). No memory leakages anymore.
I'll address this problem in a near future, may be when I move to the
new kubuntu 16.04 LTS. This will bring new SW releases and the problem
may have been fixed.

Thanks
Paulo

[toc] | [prev] | [next] | [standalone]


#102898

FromChris Angelico <rosuav@gmail.com>
Date2016-02-14 13:01 +1100
Message-ID<mailman.102.1455415322.22075.python-list@python.org>
In reply to#102897
On Sun, Feb 14, 2016 at 12:44 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>> What happens if, after hashing each file (and returning from this
>> function), you call gc.collect()? If that reduces your RAM usage, you
>> have reference cycles somewhere.
>>
> I have used gc and del. No luck.
>
> The most probable cause seems to be hashlib not correctly handling big
> buffers updates. I am working in a computer and testing in another. For
> the second part may be somehow I forgot to transfer the change to the
> other computer. Unlikely but possible.

I'd like to see the problem boiled down to just the hashlib calls.
Something like this:

import hashlib
data = b"*" * 4*1024*1024
lastdig = None
while "simulating files":
    h = hashlib.sha256()
    hu = h.update
    for chunk in range(100):
        hu(data)
        dig = h.hexdigest()
    if lastdig is None:
        lastdig = dig
        print("Digest:",dig)
    else:
        if lastdig != dig:
            print("Digest fail!")

Running this on my system (Python 3.6 on Debian Linux) produces a
long-running process with stable memory usage, which is exactly what
I'd expect. Even using different data doesn't change that:

import hashlib
import itertools
byte = itertools.count()
data = b"*" * 4*1024*1024
while "simulating files":
    h = hashlib.sha256()
    hu = h.update
    for chunk in range(100):
        hu(data + bytes([next(byte)&255]))
    dig = h.hexdigest()
    print("Digest:",dig)

Somewhere between my code and yours is something that consumes all
that memory. Can you neuter the actual disk reading (replacing it with
constants, like this) and make a complete and shareable program that
leaks all that memory?

ChrisA

[toc] | [prev] | [next] | [standalone]


#102899

FromSteven D'Aprano <steve@pearwood.info>
Date2016-02-14 13:21 +1100
Message-ID<56bfe49e$0$1587$c3e8da3$5496439d@news.astraweb.com>
In reply to#102888
On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote:

> Hello all.
> 
> I'm running in a very strange (for me at least) problem.
> 
> def getHash(self):
> bfsz=File.blksz
> h=hashlib.sha256()
> hu=h.update
> with open(self.getPath(),'rb') as f:
> f.seek(File.hdrsz)    # Skip header
> b=f.read(bfsz)
> while len(b)>0:
> hu(b)
> b=f.read(bfsz)
> fhash=h.digest()
> return fhash

This is a good, and tricky, question! Unfortunately, this sort of
performance issue may depend on the specific details of your system.

You can start by telling us what version of Python you are running. You've
already said you're running on Kubuntu, which makes it Linux. Is that a
32-bit or 64-bit version?


Next, let's see if we can simplify the code and make it runnable by anyone,
in the spirit of http://www.sscce.org/



import hashlib
K = 1024
M = 1024*K


def get_hash(pathname, size):
    h = hashlib.sha256()
    with open(pathname, 'rb') as f:
        f.seek(4*K)
        b = f.read(size)
        while b:
            h.update(b)
            b = f.read(size)
    return h.digest()


Does this simplified version demonstrate the same problem?

What happens if you eliminate the actual hashing?


def get_hash(pathname, size):
    with open(pathname, 'rb') as f:
        f.seek(4*K)
        b = f.read(size)
        while b:
            b = f.read(size)
    return "1234"*16


This may allow you to determine whether the problem lies in *reading* the
files or *hashing* the files.


Be warned: if you read from the same file over and over again, Linux will
cache that file, and your tests will not reflect the behaviour when you
read thousands of different files from disk rather than from memory cache.

What sort of media are you reading from?

- hard drive?
- flash drive or USB stick?
- solid state disk?
- something else?

They will all have different read characteristics.

What happens when you call f.read(size)? By default, Python uses the
following buffering strategy for binary files:


    * Binary files are buffered in fixed-size chunks; the size of the buffer
      is chosen using a heuristic trying to determine the underlying
      device's "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
      On many systems, the buffer will typically be 4096 or 8192 bytes long.


See help(open).

That's your first clue that, perhaps, you should be reading in relatively
small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling
shows that typically you should read from files in small-ish chunks, and
that trying to read in large chunks is often counter-productive:

https://duckduckgo.com/html/?q=file+read+buffer+size

The first three links all talk about optimal sizes being measured in small
multiples of 4K, not 40MB.

You can try to increase the system buffer, by changing the "open" line to:

    with open(pathname, 'rb', buffering=40*M) as f:

and see whether that helps.


By the way, do you need a cryptographic checksum? sha256 is expensive to
calculate. If all you are doing is trying to match files which could have
the same content, you could use a cheaper hash, like md5 or even crc32.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#102948

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-15 08:05 +0000
Message-ID<n9s0sp$u3i$1@gioia.aioe.org>
In reply to#102899
Às 02:21 de 14-02-2016, Steven D'Aprano escreveu:
> On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote:
...

Thanks Steven for your advices.
This is a small script to solve a specific problem.
It will be used in future to solve other similar problems probably with
small changes.
When I found it eating memory and, what I thought was the 1st reason for
that was fixed and it still ate the memory, I thought of something less
obvious. After all it seems there is nothing wrong with it (see my other
post).

> That's your first clue that, perhaps, you should be reading in relatively
> small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling
> shows that typically you should read from files in small-ish chunks, and
> that trying to read in large chunks is often counter-productive:
> 
> https://duckduckgo.com/html/?q=file+read+buffer+size
> 
> The first three links all talk about optimal sizes being measured in small
> multiples of 4K, not 40MB.
>
I didn't know about this!
Most of my files are about ~>30MB. So I chose 40MB to avoid python
loops. After all, python should be able to optimize those things.

> You can try to increase the system buffer, by changing the "open" line to:
> 
>     with open(pathname, 'rb', buffering=40*M) as f:
> 
This is another thing. One thing is the requested amount of data I want
another is to choose de "really" buffer size. (I didn't know about this
argument - thanks).
...

> By the way, do you need a cryptographic checksum? sha256 is expensive to
> calculate. If all you are doing is trying to match files which could have
> the same content, you could use a cheaper hash, like md5 or even crc32.
I don't know the probability of collision of each of them. The script
has sha256 and md5 as options. When the failed execution I had chosen
sha256. I didn't check if it takes much more time. A collision might
cause data loss. So ...

Thank you.
Paulo

[toc] | [prev] | [next] | [standalone]


#102908

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-14 07:04 +0000
Message-ID<n9p8u3$130l$1@gioia.aioe.org>
In reply to#102888
I was unable to reproduce the situation using a simple program just
walking through all files>4K, with or without the seek, and computing
their shasums.
Only some fluctuations of about 500MB in memory consumption.

I'll look at this when I get more time, taking in consideration the
suggestions here posted.

For the time being, my work is done. With a small buffer size (16k) the
results produced were correct and no memory was leaked!

If I can find any explanation (if not embarrassing :-) ), I'll post it here.

Thank you all.
Paulo

[toc] | [prev] | [next] | [standalone]


#102909

FromINADA Naoki <songofacandy@gmail.com>
Date2016-02-14 18:49 +0900
Message-ID<mailman.104.1455443395.22075.python-list@python.org>
In reply to#102908
tracemalloc module may help you to investigate leaks.
2016/02/14 午後4:05 "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>:

> I was unable to reproduce the situation using a simple program just
> walking through all files>4K, with or without the seek, and computing
> their shasums.
> Only some fluctuations of about 500MB in memory consumption.
>
> I'll look at this when I get more time, taking in consideration the
> suggestions here posted.
>
> For the time being, my work is done. With a small buffer size (16k) the
> results produced were correct and no memory was leaked!
>
> If I can find any explanation (if not embarrassing :-) ), I'll post it
> here.
>
> Thank you all.
> Paulo
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

[toc] | [prev] | [next] | [standalone]


#102945

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-15 07:38 +0000
Message-ID<n9rv8t$rr3$1@gioia.aioe.org>
In reply to#102909
Às 09:49 de 14-02-2016, INADA Naoki escreveu:
> tracemalloc module may help you to investigate leaks.
> 2016/02/14 午後4:05 "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>:
> 
Thanks. I didn't know it!
Paulo

[toc] | [prev] | [next] | [standalone]


#102933

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-15 02:21 +0000
Message-ID<n9rcn0$9pi$1@gioia.aioe.org>
In reply to#102908
Às 07:04 de 14-02-2016, Paulo da Silva escreveu:
> I was unable to reproduce the situation using a simple program just
> walking through all files>4K, with or without the seek, and computing
> their shasums.
> Only some fluctuations of about 500MB in memory consumption.

Today I gave another try to the program using 40MB bfsz on the same
circumstances except for a previous reboot and, surprisingly, it worked
pretty fine. The fluctuations in memory were of the same magnitude of
those of the simple program. No swaps at all!

Some history ...

The 1st. time the problem occurred, I found an issue that I thought
could cause that behavior. An equivalent statement for
h=hashlib.sha256() was out of the files loop.
I put it in the arguments parser because the user could choose the
algorithm to use. And instead of testing the option for each file I put
it there. Apart from the memory leakage hashlib seemed to work fine.
After the "digest" I started feeding it with the contents of another file.

1. Is it possible that the memory exhaustion caused some sort of problem
that left the system in a way to cause gc malfunction on the next runs?

2. The filesystem is btrfs.
So, is it possible some "fight" among btrfs, gc and my program cause
inability to gc free memory in time?
This seems unlikely because I was only reading and the filesystem is
mounted with noatime. However I don't know if btrfs takes some
organization work during the readings.
Anyway, I tried at least 3 times the failed tests one of which updating
hashlib with 8KB chunks and another with 1MB bfsz. This last one ran
until the end but used ~5GB swap.

3. There is another small change I made since then. Some (few) times
hashlib was fed with empty data (zero length). That was fixed.

So far I tried the program twice and it ran perfectly.

When I need to run it in future, out of this confusion, and if the same
problem occurs, I'll try to see things more carefully.

Once more thank you all.
Paulo

[toc] | [prev] | [next] | [standalone]


#102950

FromJohannes Bauer <dfnsonfsduifb@gmx.de>
Date2016-02-15 09:12 +0100
Message-ID<n9s1a0$s5p$1@news.albasani.net>
In reply to#102933
On 15.02.2016 03:21, Paulo da Silva wrote:

> So far I tried the program twice and it ran perfectly.

I think you measured your RAM consumption wrong.

Linux uses all free RAM as HDD cache. That's what is used in "buffers".
That is, it's not "free", but it would be free if any process would
sbrk(). My guess is that you only looked at the "free" number going down
and concluded your program is eating your RAM. Which it wasn't.

Cheers
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>

[toc] | [prev] | [next] | [standalone]


#102972

FromPaulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>
Date2016-02-15 17:29 +0000
Message-ID<n9t1t1$plv$1@gioia.aioe.org>
In reply to#102950
Às 08:12 de 15-02-2016, Johannes Bauer escreveu:
> On 15.02.2016 03:21, Paulo da Silva wrote:
> 
>> So far I tried the program twice and it ran perfectly.
> 
> I think you measured your RAM consumption wrong.
> 
> Linux uses all free RAM as HDD cache. That's what is used in "buffers".
> That is, it's not "free", but it would be free if any process would
> sbrk(). My guess is that you only looked at the "free" number going down
> and concluded your program is eating your RAM. Which it wasn't.
> 
No, for sure.
I monitored (using atop) free, cache and swap. In general, because I
only have 2GB, freemem is almost always a few tens of MB. Remaining
"free" memory is in Cache. When Cache goes low it begins to swap out.

Paulo

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web