Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #98922 > unrolled thread
| Started by | andrea.gavana@gmail.com |
|---|---|
| First post | 2015-11-17 05:26 -0800 |
| Last post | 2015-11-18 02:31 -0800 |
| Articles | 10 — 4 participants |
Back to article view | Back to comp.lang.python
cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 05:26 -0800
Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 15:14 +0100
Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 06:20 -0800
Re: cPickle.load vs. file.read+cPickle.loads on large binary files Chris Angelico <rosuav@gmail.com> - 2015-11-18 02:20 +1100
Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 07:31 -0800
Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 16:57 +0100
Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 08:31 -0800
Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 18:20 +0100
Re: cPickle.load vs. file.read+cPickle.loads on large binary files Nagy László Zsolt <gandalf@shopzeus.com> - 2015-11-18 10:00 +0100
Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-18 02:31 -0800
| From | andrea.gavana@gmail.com |
|---|---|
| Date | 2015-11-17 05:26 -0800 |
| Subject | cPickle.load vs. file.read+cPickle.loads on large binary files |
| Message-ID | <463ad93c-0186-4911-9cd1-92d97b9dc87b@googlegroups.com> |
Hello List,
I am working with relatively humongous binary files (created via cPickle), and I stumbled across some unexpected (for me) performance differences between two approaches I use to load those files:
1. Simply use cPickle.load(fid)
2. Read the file as binary using file.read() and then use cPickle.loads on the resulting output
In the snippet below, the MakePickle function is a dummy function that generates a relatively big binary file with cPickle (WARNING: around 3 GB) in the current directory. I am using NumPy arrays to make the file big but my original data structure is much more complicated, and things like HDF5 or databases are currently not an option - I'd like to stay with pickles.
The ReadPickle function simply uses cPickle.load(fid) on the opened binary file, and on my PC it takes about 2.3 seconds (approach 1).
The ReadPlusLoads function reads the file using file.read() and then use cPickle.loads on the resulting output (approach 2). On my PC, the file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds.
What baffles me is the time it takes to read the file using file.read(): is there any way to slurp it all in one go (somehow) into a string ready for cPickle.loads without that much of an overhead?
Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).
Thank you in advance for all suggestions :-) .
Andrea.
# Begin code
import os, sys
import time
import cPickle
import numpy
class Dummy(object):
def __init__(self, name):
self.name = name
self.data = numpy.random.rand(200, 600, 10)
def MakePickle():
num_objects = 300
list_of_objects = []
for index in xrange(num_objects):
dummy = Dummy('dummy_%d'%index)
list_of_objects.append(dummy)
fid = open('dummy.pkl', 'wb')
start = time.time()
out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL)
end = time.time()
print 'cPickle.dumps time:', end-start
start = end
fid.write(out)
end = time.time()
print 'file.write time:', end-start
fid.close()
def ReadPickle():
fid = open('dummy.pkl', 'rb')
start = time.time()
out = cPickle.load(fid)
end = time.time()
print 'cPickle.load time:', end-start
fid.close()
def ReadPlusLoads():
start = time.time()
fid = open('dummy.pkl', 'rb')
strs = fid.read()
fid.close()
end = time.time()
print 'file.read time:', end-start
start = end
out = cPickle.loads(strs)
end = time.time()
print 'cPickle.loads time:', end-start
if __name__ == '__main__':
ReadPickle()
ReadPlusLoads()
# End code
[toc] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-11-17 15:14 +0100 |
| Message-ID | <mailman.387.1447769670.16136.python-list@python.org> |
| In reply to | #98922 |
andrea.gavana@gmail.com wrote: > Hello List, > > I am working with relatively humongous binary files (created via > cPickle), and I stumbled across some unexpected (for me) performance > differences between two approaches I use to load those files: > > 1. Simply use cPickle.load(fid) > > 2. Read the file as binary using file.read() and then use cPickle.loads on > the resulting output > > In the snippet below, the MakePickle function is a dummy function that > generates a relatively big binary file with cPickle (WARNING: around 3 GB) > in the current directory. I am using NumPy arrays to make the file big but > my original data structure is much more complicated, and things like HDF5 > or databases are currently not an option - I'd like to stay with pickles. > > The ReadPickle function simply uses cPickle.load(fid) on the opened binary > file, and on my PC it takes about 2.3 seconds (approach 1). > > The ReadPlusLoads function reads the file using file.read() and then use > cPickle.loads on the resulting output (approach 2). On my PC, the > file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 > seconds. > > What baffles me is the time it takes to read the file using file.read(): > is there any way to slurp it all in one go (somehow) into a string ready > for cPickle.loads without that much of an overhead? > > Note that all of this has been done on Windows 7 64bit with Python 2.7 > 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem). > > Thank you in advance for all suggestions :-) . > > Andrea. > > if __name__ == '__main__': > ReadPickle() > ReadPlusLoads() Do you get roughly the same times when you execute ReadPlusLoads() before ReadPIckle()?
[toc] | [prev] | [next] | [standalone]
| From | andrea.gavana@gmail.com |
|---|---|
| Date | 2015-11-17 06:20 -0800 |
| Message-ID | <54330891-6568-4469-93ae-7a7825961500@googlegroups.com> |
| In reply to | #98923 |
Hi Peter, On Tuesday, November 17, 2015 at 3:14:57 PM UTC+1, Peter Otten wrote: > Andrea Gavana wrote: > > > Hello List, > > > > I am working with relatively humongous binary files (created via > > cPickle), and I stumbled across some unexpected (for me) performance > > differences between two approaches I use to load those files: > > > > 1. Simply use cPickle.load(fid) > > > > 2. Read the file as binary using file.read() and then use cPickle.loads on > > the resulting output > > > > In the snippet below, the MakePickle function is a dummy function that > > generates a relatively big binary file with cPickle (WARNING: around 3 GB) > > in the current directory. I am using NumPy arrays to make the file big but > > my original data structure is much more complicated, and things like HDF5 > > or databases are currently not an option - I'd like to stay with pickles. > > > > The ReadPickle function simply uses cPickle.load(fid) on the opened binary > > file, and on my PC it takes about 2.3 seconds (approach 1). > > > > The ReadPlusLoads function reads the file using file.read() and then use > > cPickle.loads on the resulting output (approach 2). On my PC, the > > file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 > > seconds. > > > > What baffles me is the time it takes to read the file using file.read(): > > is there any way to slurp it all in one go (somehow) into a string ready > > for cPickle.loads without that much of an overhead? > > > > Note that all of this has been done on Windows 7 64bit with Python 2.7 > > 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem). > > > > Thank you in advance for all suggestions :-) . > > > > Andrea. > > > > if __name__ == '__main__': > > ReadPickle() > > ReadPlusLoads() > > Do you get roughly the same times when you execute ReadPlusLoads() before > ReadPIckle()? Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid). I thought that the order of operations might be an issue but apparently that was not the case... Andrea.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-11-18 02:20 +1100 |
| Message-ID | <mailman.392.1447773612.16136.python-list@python.org> |
| In reply to | #98924 |
On Wed, Nov 18, 2015 at 1:20 AM, <andrea.gavana@gmail.com> wrote: > Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid). > > I thought that the order of operations might be an issue but apparently that was not the case... What if you call one of them twice and then the other? Just trying to rule out any possibility that it's a caching problem. On my Linux box, running 2.7.9 64-bit, the two operations take roughly the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to loads). Are you able to run this off a RAM disk or something? Most curious. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | andrea.gavana@gmail.com |
|---|---|
| Date | 2015-11-17 07:31 -0800 |
| Message-ID | <420ec4e9-6af6-49bd-a9f4-8b47ef1f136e@googlegroups.com> |
| In reply to | #98928 |
Hi Chris, On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote: > On Wed, Nov 18, 2015 at 1:20 AM, Andrea Gavana wrote: > > Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid). > > > > I thought that the order of operations might be an issue but apparently that was not the case... > > What if you call one of them twice and then the other? Just trying to > rule out any possibility that it's a caching problem. > > On my Linux box, running 2.7.9 64-bit, the two operations take roughly > the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to > loads). Are you able to run this off a RAM disk or something? > > Most curious. Thank you for taking the time to run my little script. I have now run it with multiple combinations of calls (twice the first then the other, then viceversa, then alternate between the two functions multiple times, then three times the second and once the first, ...) with no luck at all. The file.read() line of code takes always at minimum 14 seconds (in all the trials I have done), while the cPickle.load call ranges between 2.3 and 2.5 seconds. I am puzzled with no end... Might there be something funny with my C libraries that use fread? I'm just shooting in the dark. I have a standard Python installation on Windows, nothing fancy :-( Andrea.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-11-17 16:57 +0100 |
| Message-ID | <mailman.395.1447775861.16136.python-list@python.org> |
| In reply to | #98930 |
andrea.gavana@gmail.com wrote:
> Hi Chris,
>
> On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote:
>> On Wed, Nov 18, 2015 at 1:20 AM, Andrea Gavana wrote:
>> > Thank you for your answer. I do get similar timings when I swap the two
>> > functions, and specifically still 15 seconds to read the file via
>> > file.read() and 2.4 seconds (more or less as before) via
>> > cPickle.load(fid).
>> >
>> > I thought that the order of operations might be an issue but apparently
>> > that was not the case...
>>
>> What if you call one of them twice and then the other? Just trying to
>> rule out any possibility that it's a caching problem.
>>
>> On my Linux box, running 2.7.9 64-bit, the two operations take roughly
>> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
>> loads). Are you able to run this off a RAM disk or something?
>>
>> Most curious.
>
>
> Thank you for taking the time to run my little script. I have now run it
> with multiple combinations of calls (twice the first then the other, then
> viceversa, then alternate between the two functions multiple times, then
> three times the second and once the first, ...) with no luck at all.
>
> The file.read() line of code takes always at minimum 14 seconds (in all
> the trials I have done), while the cPickle.load call ranges between 2.3
> and 2.5 seconds.
>
> I am puzzled with no end... Might there be something funny with my C
> libraries that use fread? I'm just shooting in the dark. I have a standard
> Python installation on Windows, nothing fancy :-(
Perhaps there is a size threshold? You could experiment with different block
sizes in the following f.read() replacement:
def read_chunked(f, size=2**20):
read = functools.partial(f.read, size)
return "".join(iter(read, ""))
[toc] | [prev] | [next] | [standalone]
| From | andrea.gavana@gmail.com |
|---|---|
| Date | 2015-11-17 08:31 -0800 |
| Message-ID | <f9a98231-e6df-4557-8ca1-20d9644825ca@googlegroups.com> |
| In reply to | #98932 |
Hi Peter, On Tuesday, November 17, 2015 at 4:57:57 PM UTC+1, Peter Otten wrote: > Andrea Gavana wrote: > > > Hi Chris, > > > > On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote: > >> On Wed, Nov 18, 2015 at 1:20 AM, Andrea Gavana wrote: > >> > Thank you for your answer. I do get similar timings when I swap the two > >> > functions, and specifically still 15 seconds to read the file via > >> > file.read() and 2.4 seconds (more or less as before) via > >> > cPickle.load(fid). > >> > > >> > I thought that the order of operations might be an issue but apparently > >> > that was not the case... > >> > >> What if you call one of them twice and then the other? Just trying to > >> rule out any possibility that it's a caching problem. > >> > >> On my Linux box, running 2.7.9 64-bit, the two operations take roughly > >> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to > >> loads). Are you able to run this off a RAM disk or something? > >> > >> Most curious. > > > > > > Thank you for taking the time to run my little script. I have now run it > > with multiple combinations of calls (twice the first then the other, then > > viceversa, then alternate between the two functions multiple times, then > > three times the second and once the first, ...) with no luck at all. > > > > The file.read() line of code takes always at minimum 14 seconds (in all > > the trials I have done), while the cPickle.load call ranges between 2.3 > > and 2.5 seconds. > > > > I am puzzled with no end... Might there be something funny with my C > > libraries that use fread? I'm just shooting in the dark. I have a standard > > Python installation on Windows, nothing fancy :-( > > Perhaps there is a size threshold? You could experiment with different block > sizes in the following f.read() replacement: > > def read_chunked(f, size=2**20): > read = functools.partial(f.read, size) > return "".join(iter(read, "")) Thank you for the suggestion. I have used the read_chunked function in my experiments now and I can report a nice improvements - I have tried various chunk sizes, from 2**10 to 2**31-1, and in general the optimum lies around size=2**22, although it is essentially flat from 2**20 up to 2**30 - with some interesting spikes at 45 seconds for 2**14 and 2**15 (see table below). Using your suggestion, I got it down to 3.4 seconds (on average). Still at least twice slower than cPickle.load, but better. What I find most puzzling is that a pure file.read() (or your read_chunked variation) should normally be much faster than a cPickle.load (which does so many more things than just reading a file), shouldn't it? Timing table: Size (power of 2) Read Time (seconds) 10 9.14 11 8.59 12 7.67 13 5.70 14 46.06 15 45.00 16 24.80 17 14.23 18 8.95 19 5.58 20 3.41 21 3.39 22 3.34 23 3.39 24 3.39 25 3.42 26 3.43 27 3.44 28 3.48 29 3.59 30 3.72
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-11-17 18:20 +0100 |
| Message-ID | <mailman.399.1447780855.16136.python-list@python.org> |
| In reply to | #98936 |
andrea.gavana@gmail.com wrote: >> > I am puzzled with no end... Might there be something funny with my C >> > libraries that use fread? I'm just shooting in the dark. I have a >> > standard Python installation on Windows, nothing fancy :-( >> >> Perhaps there is a size threshold? You could experiment with different >> block sizes in the following f.read() replacement: >> >> def read_chunked(f, size=2**20): >> read = functools.partial(f.read, size) >> return "".join(iter(read, "")) > > > Thank you for the suggestion. I have used the read_chunked function in my > experiments now and I can report a nice improvements - I have tried > various chunk sizes, from 2**10 to 2**31-1, and in general the optimum > lies around size=2**22, although it is essentially flat from 2**20 up to > 2**30 - with some interesting spikes at 45 seconds for 2**14 and 2**15 > (see table below). > > Using your suggestion, I got it down to 3.4 seconds (on average). Still at > least twice slower than cPickle.load, but better. > > What I find most puzzling is that a pure file.read() (or your read_chunked > variation) should normally be much faster than a cPickle.load (which does > so many more things than just reading a file), shouldn't it? That would have been my expectation, too. I had a quick look into the fileobject.c source and didn't see anything that struck me as suspicious. I think you should file a bug report so that an expert can check if there is an underlying problem in Python or if it is a matter of the OS. > Timing table: > > Size (power of 2) Read Time (seconds) > 10 9.14 > 11 8.59 > 12 7.67 > 13 5.70 > 14 46.06 > 15 45.00 > 16 24.80 > 17 14.23 > 18 8.95 > 19 5.58 > 20 3.41 > 21 3.39 > 22 3.34 > 23 3.39 > 24 3.39 > 25 3.42 > 26 3.43 > 27 3.44 > 28 3.48 > 29 3.59 > 30 3.72
[toc] | [prev] | [next] | [standalone]
| From | Nagy László Zsolt <gandalf@shopzeus.com> |
|---|---|
| Date | 2015-11-18 10:00 +0100 |
| Message-ID | <mailman.406.1447837229.16136.python-list@python.org> |
| In reply to | #98930 |
> Perhaps there is a size threshold? You could experiment with different block > sizes in the following f.read() replacement: > > def read_chunked(f, size=2**20): > read = functools.partial(f.read, size) > return "".join(iter(read, "")) > Under win32 platform, my experience is that the fastest way to read binary file from disk is the mmap module. You should try that too.
[toc] | [prev] | [next] | [standalone]
| From | andrea.gavana@gmail.com |
|---|---|
| Date | 2015-11-18 02:31 -0800 |
| Message-ID | <f2a42caa-2cde-455b-83bb-3beef8e4c52f@googlegroups.com> |
| In reply to | #98952 |
Hi,
On Wednesday, November 18, 2015 at 10:00:43 AM UTC+1, Nagy László Zsolt wrote:
> > Perhaps there is a size threshold? You could experiment with different block
> > sizes in the following f.read() replacement:
> >
> > def read_chunked(f, size=2**20):
> > read = functools.partial(f.read, size)
> > return "".join(iter(read, ""))
> >
> Under win32 platform, my experience is that the fastest way to read
> binary file from disk is the mmap module. You should try that too.
Thank you for your suggestion. I have tried that now, and with my naive approach I have done this:
start = time.time()
fid = open(filename, 'r+b')
strs = mmap.mmap(fid.fileno(), 0, access=mmap.ACCESS_READ)[:]
end = time.time()
print 'mmap.read time:', end-start
And it takes about 2.7 seconds. Not a bad improvement :-) . Unfortunately, when the file is on a network drive, all the other approaches ran at around 25-30 seconds loading time, while the mmap one clocks at 110 seconds :-(
Andrea.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web