Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102888 > unrolled thread
| Started by | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| First post | 2016-02-13 19:29 +0000 |
| Last post | 2016-02-15 17:29 +0000 |
| Articles | 13 — 5 participants |
Back to article view | Back to comp.lang.python
What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-13 19:29 +0000
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-13 22:26 +0000
Re: What is heating the memory here? hashlib? Chris Angelico <rosuav@gmail.com> - 2016-02-14 09:45 +1100
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-14 01:44 +0000
Re: What is heating the memory here? hashlib? Chris Angelico <rosuav@gmail.com> - 2016-02-14 13:01 +1100
Re: What is heating the memory here? hashlib? Steven D'Aprano <steve@pearwood.info> - 2016-02-14 13:21 +1100
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 08:05 +0000
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-14 07:04 +0000
Re: What is heating the memory here? hashlib? INADA Naoki <songofacandy@gmail.com> - 2016-02-14 18:49 +0900
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 07:38 +0000
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 02:21 +0000
Re: What is heating the memory here? hashlib? Johannes Bauer <dfnsonfsduifb@gmx.de> - 2016-02-15 09:12 +0100
Re: What is heating the memory here? hashlib? Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> - 2016-02-15 17:29 +0000
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-13 19:29 +0000 |
| Subject | What is heating the memory here? hashlib? |
| Message-ID | <n9o06t$1hjo$1@gioia.aioe.org> |
Hello all. I'm running in a very strange (for me at least) problem. def getHash(self): bfsz=File.blksz h=hashlib.sha256() hu=h.update with open(self.getPath(),'rb') as f: f.seek(File.hdrsz) # Skip header b=f.read(bfsz) while len(b)>0: hu(b) b=f.read(bfsz) fhash=h.digest() return fhash hdrsz is always 4K here. All files are greater than 4K. If I use a 40MB bfsz this tooks all my memory very quickly. After few hundreds of files it begins to swap ending up with the program being killed (BTW, I'm using linux kubuntu 14.04). If I reduce bfsz to 1MB it successfully completes my full test (~100000 files) reaching about 6GB of memory. If I reduce further bfsz to 16KB there is no noticeable memory taken!! I have tried the following code, but it didn't fix the problem: def getHash(self): bfsz=File.blksz h=hashlib.sha256() hu=h.update with open(self.getPath(),'rb') as f: husz=8192 f.seek(File.hdrsz) # Skip header b=f.read(bfsz) while len(b)>0: for i in range(0,len(b),husz): hu(b[i:i+husz]) b=f.read(bfsz) fhash=h.digest() return fhash What is wrong here?! Thanks for any help/comments. Paulo
[toc] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-13 22:26 +0000 |
| Message-ID | <n9oaja$39n$1@gioia.aioe.org> |
| In reply to | #102888 |
I meant eating! :-)
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-14 09:45 +1100 |
| Message-ID | <mailman.99.1455403508.22075.python-list@python.org> |
| In reply to | #102893 |
On Sun, Feb 14, 2016 at 9:26 AM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote: > I meant eating! :-) Heh, "heating" works too - the more you use memory, the more it heats up :) I'm assuming this is inside "class File:" and you have class members for your constants like header size? There's no context for the name "File" otherwise. What happens if, after hashing each file (and returning from this function), you call gc.collect()? If that reduces your RAM usage, you have reference cycles somewhere. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-14 01:44 +0000 |
| Message-ID | <n9om6l$hkt$1@gioia.aioe.org> |
| In reply to | #102894 |
Às 22:45 de 13-02-2016, Chris Angelico escreveu: > On Sun, Feb 14, 2016 at 9:26 AM, Paulo da Silva > <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote: >> I meant eating! :-) > > Heh, "heating" works too - the more you use memory, the more it heats up :) :-) It is heating my head! ... > > What happens if, after hashing each file (and returning from this > function), you call gc.collect()? If that reduces your RAM usage, you > have reference cycles somewhere. > I have used gc and del. No luck. The most probable cause seems to be hashlib not correctly handling big buffers updates. I am working in a computer and testing in another. For the second part may be somehow I forgot to transfer the change to the other computer. Unlikely but possible. Anyway it is doing its job right now with bfsz=16KB (this takes a few hours). No memory leakages anymore. I'll address this problem in a near future, may be when I move to the new kubuntu 16.04 LTS. This will bring new SW releases and the problem may have been fixed. Thanks Paulo
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-14 13:01 +1100 |
| Message-ID | <mailman.102.1455415322.22075.python-list@python.org> |
| In reply to | #102897 |
On Sun, Feb 14, 2016 at 12:44 PM, Paulo da Silva
<p_s_d_a_s_i_l_v_a_ns@netcabo.pt> wrote:
>> What happens if, after hashing each file (and returning from this
>> function), you call gc.collect()? If that reduces your RAM usage, you
>> have reference cycles somewhere.
>>
> I have used gc and del. No luck.
>
> The most probable cause seems to be hashlib not correctly handling big
> buffers updates. I am working in a computer and testing in another. For
> the second part may be somehow I forgot to transfer the change to the
> other computer. Unlikely but possible.
I'd like to see the problem boiled down to just the hashlib calls.
Something like this:
import hashlib
data = b"*" * 4*1024*1024
lastdig = None
while "simulating files":
h = hashlib.sha256()
hu = h.update
for chunk in range(100):
hu(data)
dig = h.hexdigest()
if lastdig is None:
lastdig = dig
print("Digest:",dig)
else:
if lastdig != dig:
print("Digest fail!")
Running this on my system (Python 3.6 on Debian Linux) produces a
long-running process with stable memory usage, which is exactly what
I'd expect. Even using different data doesn't change that:
import hashlib
import itertools
byte = itertools.count()
data = b"*" * 4*1024*1024
while "simulating files":
h = hashlib.sha256()
hu = h.update
for chunk in range(100):
hu(data + bytes([next(byte)&255]))
dig = h.hexdigest()
print("Digest:",dig)
Somewhere between my code and yours is something that consumes all
that memory. Can you neuter the actual disk reading (replacing it with
constants, like this) and make a complete and shareable program that
leaks all that memory?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-02-14 13:21 +1100 |
| Message-ID | <56bfe49e$0$1587$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #102888 |
On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote:
> Hello all.
>
> I'm running in a very strange (for me at least) problem.
>
> def getHash(self):
> bfsz=File.blksz
> h=hashlib.sha256()
> hu=h.update
> with open(self.getPath(),'rb') as f:
> f.seek(File.hdrsz) # Skip header
> b=f.read(bfsz)
> while len(b)>0:
> hu(b)
> b=f.read(bfsz)
> fhash=h.digest()
> return fhash
This is a good, and tricky, question! Unfortunately, this sort of
performance issue may depend on the specific details of your system.
You can start by telling us what version of Python you are running. You've
already said you're running on Kubuntu, which makes it Linux. Is that a
32-bit or 64-bit version?
Next, let's see if we can simplify the code and make it runnable by anyone,
in the spirit of http://www.sscce.org/
import hashlib
K = 1024
M = 1024*K
def get_hash(pathname, size):
h = hashlib.sha256()
with open(pathname, 'rb') as f:
f.seek(4*K)
b = f.read(size)
while b:
h.update(b)
b = f.read(size)
return h.digest()
Does this simplified version demonstrate the same problem?
What happens if you eliminate the actual hashing?
def get_hash(pathname, size):
with open(pathname, 'rb') as f:
f.seek(4*K)
b = f.read(size)
while b:
b = f.read(size)
return "1234"*16
This may allow you to determine whether the problem lies in *reading* the
files or *hashing* the files.
Be warned: if you read from the same file over and over again, Linux will
cache that file, and your tests will not reflect the behaviour when you
read thousands of different files from disk rather than from memory cache.
What sort of media are you reading from?
- hard drive?
- flash drive or USB stick?
- solid state disk?
- something else?
They will all have different read characteristics.
What happens when you call f.read(size)? By default, Python uses the
following buffering strategy for binary files:
* Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying
device's "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
On many systems, the buffer will typically be 4096 or 8192 bytes long.
See help(open).
That's your first clue that, perhaps, you should be reading in relatively
small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling
shows that typically you should read from files in small-ish chunks, and
that trying to read in large chunks is often counter-productive:
https://duckduckgo.com/html/?q=file+read+buffer+size
The first three links all talk about optimal sizes being measured in small
multiples of 4K, not 40MB.
You can try to increase the system buffer, by changing the "open" line to:
with open(pathname, 'rb', buffering=40*M) as f:
and see whether that helps.
By the way, do you need a cryptographic checksum? sha256 is expensive to
calculate. If all you are doing is trying to match files which could have
the same content, you could use a cheaper hash, like md5 or even crc32.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-15 08:05 +0000 |
| Message-ID | <n9s0sp$u3i$1@gioia.aioe.org> |
| In reply to | #102899 |
Às 02:21 de 14-02-2016, Steven D'Aprano escreveu: > On Sun, 14 Feb 2016 06:29 am, Paulo da Silva wrote: ... Thanks Steven for your advices. This is a small script to solve a specific problem. It will be used in future to solve other similar problems probably with small changes. When I found it eating memory and, what I thought was the 1st reason for that was fixed and it still ate the memory, I thought of something less obvious. After all it seems there is nothing wrong with it (see my other post). > That's your first clue that, perhaps, you should be reading in relatively > small blocks, more like 4K than 4MB. Sure enough, a quick bit of googling > shows that typically you should read from files in small-ish chunks, and > that trying to read in large chunks is often counter-productive: > > https://duckduckgo.com/html/?q=file+read+buffer+size > > The first three links all talk about optimal sizes being measured in small > multiples of 4K, not 40MB. > I didn't know about this! Most of my files are about ~>30MB. So I chose 40MB to avoid python loops. After all, python should be able to optimize those things. > You can try to increase the system buffer, by changing the "open" line to: > > with open(pathname, 'rb', buffering=40*M) as f: > This is another thing. One thing is the requested amount of data I want another is to choose de "really" buffer size. (I didn't know about this argument - thanks). ... > By the way, do you need a cryptographic checksum? sha256 is expensive to > calculate. If all you are doing is trying to match files which could have > the same content, you could use a cheaper hash, like md5 or even crc32. I don't know the probability of collision of each of them. The script has sha256 and md5 as options. When the failed execution I had chosen sha256. I didn't check if it takes much more time. A collision might cause data loss. So ... Thank you. Paulo
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-14 07:04 +0000 |
| Message-ID | <n9p8u3$130l$1@gioia.aioe.org> |
| In reply to | #102888 |
I was unable to reproduce the situation using a simple program just walking through all files>4K, with or without the seek, and computing their shasums. Only some fluctuations of about 500MB in memory consumption. I'll look at this when I get more time, taking in consideration the suggestions here posted. For the time being, my work is done. With a small buffer size (16k) the results produced were correct and no memory was leaked! If I can find any explanation (if not embarrassing :-) ), I'll post it here. Thank you all. Paulo
[toc] | [prev] | [next] | [standalone]
| From | INADA Naoki <songofacandy@gmail.com> |
|---|---|
| Date | 2016-02-14 18:49 +0900 |
| Message-ID | <mailman.104.1455443395.22075.python-list@python.org> |
| In reply to | #102908 |
tracemalloc module may help you to investigate leaks. 2016/02/14 午後4:05 "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>: > I was unable to reproduce the situation using a simple program just > walking through all files>4K, with or without the seek, and computing > their shasums. > Only some fluctuations of about 500MB in memory consumption. > > I'll look at this when I get more time, taking in consideration the > suggestions here posted. > > For the time being, my work is done. With a small buffer size (16k) the > results produced were correct and no memory was leaked! > > If I can find any explanation (if not embarrassing :-) ), I'll post it > here. > > Thank you all. > Paulo > > -- > https://mail.python.org/mailman/listinfo/python-list >
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-15 07:38 +0000 |
| Message-ID | <n9rv8t$rr3$1@gioia.aioe.org> |
| In reply to | #102909 |
Às 09:49 de 14-02-2016, INADA Naoki escreveu: > tracemalloc module may help you to investigate leaks. > 2016/02/14 午後4:05 "Paulo da Silva" <p_s_d_a_s_i_l_v_a_ns@netcabo.pt>: > Thanks. I didn't know it! Paulo
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-15 02:21 +0000 |
| Message-ID | <n9rcn0$9pi$1@gioia.aioe.org> |
| In reply to | #102908 |
Às 07:04 de 14-02-2016, Paulo da Silva escreveu: > I was unable to reproduce the situation using a simple program just > walking through all files>4K, with or without the seek, and computing > their shasums. > Only some fluctuations of about 500MB in memory consumption. Today I gave another try to the program using 40MB bfsz on the same circumstances except for a previous reboot and, surprisingly, it worked pretty fine. The fluctuations in memory were of the same magnitude of those of the simple program. No swaps at all! Some history ... The 1st. time the problem occurred, I found an issue that I thought could cause that behavior. An equivalent statement for h=hashlib.sha256() was out of the files loop. I put it in the arguments parser because the user could choose the algorithm to use. And instead of testing the option for each file I put it there. Apart from the memory leakage hashlib seemed to work fine. After the "digest" I started feeding it with the contents of another file. 1. Is it possible that the memory exhaustion caused some sort of problem that left the system in a way to cause gc malfunction on the next runs? 2. The filesystem is btrfs. So, is it possible some "fight" among btrfs, gc and my program cause inability to gc free memory in time? This seems unlikely because I was only reading and the filesystem is mounted with noatime. However I don't know if btrfs takes some organization work during the readings. Anyway, I tried at least 3 times the failed tests one of which updating hashlib with 8KB chunks and another with 1MB bfsz. This last one ran until the end but used ~5GB swap. 3. There is another small change I made since then. Some (few) times hashlib was fed with empty data (zero length). That was fixed. So far I tried the program twice and it ran perfectly. When I need to run it in future, out of this confusion, and if the same problem occurs, I'll try to see things more carefully. Once more thank you all. Paulo
[toc] | [prev] | [next] | [standalone]
| From | Johannes Bauer <dfnsonfsduifb@gmx.de> |
|---|---|
| Date | 2016-02-15 09:12 +0100 |
| Message-ID | <n9s1a0$s5p$1@news.albasani.net> |
| In reply to | #102933 |
On 15.02.2016 03:21, Paulo da Silva wrote: > So far I tried the program twice and it ran perfectly. I think you measured your RAM consumption wrong. Linux uses all free RAM as HDD cache. That's what is used in "buffers". That is, it's not "free", but it would be free if any process would sbrk(). My guess is that you only looked at the "free" number going down and concluded your program is eating your RAM. Which it wasn't. Cheers Johannes -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1@speranza.aioe.org>
[toc] | [prev] | [next] | [standalone]
| From | Paulo da Silva <p_s_d_a_s_i_l_v_a_ns@netcabo.pt> |
|---|---|
| Date | 2016-02-15 17:29 +0000 |
| Message-ID | <n9t1t1$plv$1@gioia.aioe.org> |
| In reply to | #102950 |
Às 08:12 de 15-02-2016, Johannes Bauer escreveu: > On 15.02.2016 03:21, Paulo da Silva wrote: > >> So far I tried the program twice and it ran perfectly. > > I think you measured your RAM consumption wrong. > > Linux uses all free RAM as HDD cache. That's what is used in "buffers". > That is, it's not "free", but it would be free if any process would > sbrk(). My guess is that you only looked at the "free" number going down > and concluded your program is eating your RAM. Which it wasn't. > No, for sure. I monitored (using atop) free, cache and swap. In general, because I only have 2GB, freemem is almost always a few tens of MB. Remaining "free" memory is in Cache. When Cache goes low it begins to swap out. Paulo
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web