Groups > comp.lang.python > #98922 > unrolled thread

cPickle.load vs. file.read+cPickle.loads on large binary files

Started by	andrea.gavana@gmail.com
First post	2015-11-17 05:26 -0800
Last post	2015-11-18 02:31 -0800
Articles	10 — 4 participants

Back to article view | Back to comp.lang.python

  cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 05:26 -0800
    Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 15:14 +0100
      Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 06:20 -0800
        Re: cPickle.load vs. file.read+cPickle.loads on large binary files Chris Angelico <rosuav@gmail.com> - 2015-11-18 02:20 +1100
          Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 07:31 -0800
            Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 16:57 +0100
              Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 08:31 -0800
                Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 18:20 +0100
            Re: cPickle.load vs. file.read+cPickle.loads on large binary files Nagy László Zsolt <gandalf@shopzeus.com> - 2015-11-18 10:00 +0100
              Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-18 02:31 -0800

#98922 — cPickle.load vs. file.read+cPickle.loads on large binary files

From	andrea.gavana@gmail.com
Date	2015-11-17 05:26 -0800
Subject	cPickle.load vs. file.read+cPickle.loads on large binary files
Message-ID	<463ad93c-0186-4911-9cd1-92d97b9dc87b@googlegroups.com>

Hello List,

     I am working with relatively humongous binary files (created via cPickle), and I stumbled across some unexpected (for me) performance differences between two approaches I use to load those files:

1. Simply use cPickle.load(fid)

2. Read the file as binary using file.read() and then use cPickle.loads on the resulting output

In the snippet below, the MakePickle function is a dummy function that generates a relatively big binary file with cPickle (WARNING: around 3 GB) in the current directory. I am using NumPy arrays to make the file big but my original data structure is much more complicated, and things like HDF5 or databases are currently not an option - I'd like to stay with pickles.

The ReadPickle function simply uses cPickle.load(fid) on the opened binary file, and on my PC it takes about 2.3 seconds (approach 1).

The ReadPlusLoads function reads the file using file.read() and then use cPickle.loads on the resulting output (approach 2). On my PC, the file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5 seconds.

What baffles me is the time it takes to read the file using file.read(): is there any way to slurp it all in one go (somehow) into a string ready for cPickle.loads without that much of an overhead?

Note that all of this has been done on Windows 7 64bit with Python 2.7 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).

Thank you in advance for all suggestions :-) .

Andrea.


# Begin code

import os, sys
import time
import cPickle
import numpy


class Dummy(object):

    def __init__(self, name):

        self.name = name
        self.data = numpy.random.rand(200, 600, 10)


def MakePickle():

    num_objects = 300
    list_of_objects = []

    for index in xrange(num_objects):
        dummy = Dummy('dummy_%d'%index)
        list_of_objects.append(dummy)

    fid = open('dummy.pkl', 'wb')

    start = time.time()
    out = cPickle.dumps(list_of_objects, cPickle.HIGHEST_PROTOCOL)
    end = time.time()
    print 'cPickle.dumps time:', end-start
    start = end
    fid.write(out)
    end = time.time()
    print 'file.write time:', end-start
    fid.close()


def ReadPickle():

    fid = open('dummy.pkl', 'rb')

    start = time.time()
    out = cPickle.load(fid)
    end = time.time()
    print 'cPickle.load time:', end-start
    fid.close()


def ReadPlusLoads():

    start = time.time()
    fid = open('dummy.pkl', 'rb')
    strs = fid.read()
    fid.close()
    end = time.time()
    print 'file.read time:', end-start
    start = end
    out = cPickle.loads(strs)
    end = time.time()
    print 'cPickle.loads time:', end-start


if __name__ == '__main__':
    ReadPickle()
    ReadPlusLoads()

# End code

[toc] | [next] | [standalone]

#98923

From	Peter Otten <__peter__@web.de>
Date	2015-11-17 15:14 +0100
Message-ID	<mailman.387.1447769670.16136.python-list@python.org>
In reply to	#98922

andrea.gavana@gmail.com wrote:

> Hello List,
> 
>      I am working with relatively humongous binary files (created via
>      cPickle), and I stumbled across some unexpected (for me) performance
>      differences between two approaches I use to load those files:
> 
> 1. Simply use cPickle.load(fid)
> 
> 2. Read the file as binary using file.read() and then use cPickle.loads on
> the resulting output
> 
> In the snippet below, the MakePickle function is a dummy function that
> generates a relatively big binary file with cPickle (WARNING: around 3 GB)
> in the current directory. I am using NumPy arrays to make the file big but
> my original data structure is much more complicated, and things like HDF5
> or databases are currently not an option - I'd like to stay with pickles.
> 
> The ReadPickle function simply uses cPickle.load(fid) on the opened binary
> file, and on my PC it takes about 2.3 seconds (approach 1).
> 
> The ReadPlusLoads function reads the file using file.read() and then use
> cPickle.loads on the resulting output (approach 2). On my PC, the
> file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5
> seconds.
> 
> What baffles me is the time it takes to read the file using file.read():
> is there any way to slurp it all in one go (somehow) into a string ready
> for cPickle.loads without that much of an overhead?
> 
> Note that all of this has been done on Windows 7 64bit with Python 2.7
> 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).
> 
> Thank you in advance for all suggestions :-) .
> 
> Andrea.
> 
> if __name__ == '__main__':
>     ReadPickle()
>     ReadPlusLoads()

Do you get roughly the same times when you execute ReadPlusLoads() before 
ReadPIckle()?

[toc] | [prev] | [next] | [standalone]

#98924

From	andrea.gavana@gmail.com
Date	2015-11-17 06:20 -0800
Message-ID	<54330891-6568-4469-93ae-7a7825961500@googlegroups.com>
In reply to	#98923

Hi Peter,

On Tuesday, November 17, 2015 at 3:14:57 PM UTC+1, Peter Otten wrote:
> Andrea Gavana wrote:
> 
> > Hello List,
> > 
> >      I am working with relatively humongous binary files (created via
> >      cPickle), and I stumbled across some unexpected (for me) performance
> >      differences between two approaches I use to load those files:
> > 
> > 1. Simply use cPickle.load(fid)
> > 
> > 2. Read the file as binary using file.read() and then use cPickle.loads on
> > the resulting output
> > 
> > In the snippet below, the MakePickle function is a dummy function that
> > generates a relatively big binary file with cPickle (WARNING: around 3 GB)
> > in the current directory. I am using NumPy arrays to make the file big but
> > my original data structure is much more complicated, and things like HDF5
> > or databases are currently not an option - I'd like to stay with pickles.
> > 
> > The ReadPickle function simply uses cPickle.load(fid) on the opened binary
> > file, and on my PC it takes about 2.3 seconds (approach 1).
> > 
> > The ReadPlusLoads function reads the file using file.read() and then use
> > cPickle.loads on the resulting output (approach 2). On my PC, the
> > file.read() process takes 15 seconds (!!!) and the cPickle.loads only 1.5
> > seconds.
> > 
> > What baffles me is the time it takes to read the file using file.read():
> > is there any way to slurp it all in one go (somehow) into a string ready
> > for cPickle.loads without that much of an overhead?
> > 
> > Note that all of this has been done on Windows 7 64bit with Python 2.7
> > 64bit, with 16 cores and 100 GB RAM (so memory should not be a problem).
> > 
> > Thank you in advance for all suggestions :-) .
> > 
> > Andrea.
> > 
> > if __name__ == '__main__':
> >     ReadPickle()
> >     ReadPlusLoads()
> 
> Do you get roughly the same times when you execute ReadPlusLoads() before 
> ReadPIckle()?


Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid).

I thought that the order of operations might be an issue but apparently that was not the case...

Andrea.

[toc] | [prev] | [next] | [standalone]

#98928

From	Chris Angelico <rosuav@gmail.com>
Date	2015-11-18 02:20 +1100
Message-ID	<mailman.392.1447773612.16136.python-list@python.org>
In reply to	#98924

On Wed, Nov 18, 2015 at 1:20 AM,  <andrea.gavana@gmail.com> wrote:
> Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid).
>
> I thought that the order of operations might be an issue but apparently that was not the case...

What if you call one of them twice and then the other? Just trying to
rule out any possibility that it's a caching problem.

On my Linux box, running 2.7.9 64-bit, the two operations take roughly
the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
loads). Are you able to run this off a RAM disk or something?

Most curious.

ChrisA

[toc] | [prev] | [next] | [standalone]

#98930

From	andrea.gavana@gmail.com
Date	2015-11-17 07:31 -0800
Message-ID	<420ec4e9-6af6-49bd-a9f4-8b47ef1f136e@googlegroups.com>
In reply to	#98928

Hi Chris,

On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote:
> On Wed, Nov 18, 2015 at 1:20 AM,  Andrea Gavana wrote:
> > Thank you for your answer. I do get similar timings when I swap the two functions, and specifically still 15 seconds to read the file via file.read() and 2.4 seconds (more or less as before) via cPickle.load(fid).
> >
> > I thought that the order of operations might be an issue but apparently that was not the case...
> 
> What if you call one of them twice and then the other? Just trying to
> rule out any possibility that it's a caching problem.
> 
> On my Linux box, running 2.7.9 64-bit, the two operations take roughly
> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
> loads). Are you able to run this off a RAM disk or something?
> 
> Most curious.

Thank you for taking the time to run my little script. I have now run it with multiple combinations of calls (twice the first then the other, then viceversa, then alternate between the two functions multiple times, then three times the second and once the first, ...) with no luck at all.

The file.read() line of code takes always at minimum 14 seconds (in all the trials I have done), while the cPickle.load call ranges between 2.3 and 2.5 seconds.

I am puzzled with no end... Might there be something funny with my C libraries that use fread? I'm just shooting in the dark. I have a standard Python installation on Windows, nothing fancy :-( 

Andrea.

[toc] | [prev] | [next] | [standalone]

#98932

From	Peter Otten <__peter__@web.de>
Date	2015-11-17 16:57 +0100
Message-ID	<mailman.395.1447775861.16136.python-list@python.org>
In reply to	#98930

andrea.gavana@gmail.com wrote:

> Hi Chris,
> 
> On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote:
>> On Wed, Nov 18, 2015 at 1:20 AM,  Andrea Gavana wrote:
>> > Thank you for your answer. I do get similar timings when I swap the two
>> > functions, and specifically still 15 seconds to read the file via
>> > file.read() and 2.4 seconds (more or less as before) via
>> > cPickle.load(fid).
>> >
>> > I thought that the order of operations might be an issue but apparently
>> > that was not the case...
>> 
>> What if you call one of them twice and then the other? Just trying to
>> rule out any possibility that it's a caching problem.
>> 
>> On my Linux box, running 2.7.9 64-bit, the two operations take roughly
>> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
>> loads). Are you able to run this off a RAM disk or something?
>> 
>> Most curious.
> 
> 
> Thank you for taking the time to run my little script. I have now run it
> with multiple combinations of calls (twice the first then the other, then
> viceversa, then alternate between the two functions multiple times, then
> three times the second and once the first, ...) with no luck at all.
> 
> The file.read() line of code takes always at minimum 14 seconds (in all
> the trials I have done), while the cPickle.load call ranges between 2.3
> and 2.5 seconds.
> 
> I am puzzled with no end... Might there be something funny with my C
> libraries that use fread? I'm just shooting in the dark. I have a standard
> Python installation on Windows, nothing fancy :-(

Perhaps there is a size threshold? You could experiment with different block 
sizes in the following f.read() replacement:

def read_chunked(f, size=2**20):
    read = functools.partial(f.read, size)
    return "".join(iter(read, ""))

[toc] | [prev] | [next] | [standalone]

#98936

From	andrea.gavana@gmail.com
Date	2015-11-17 08:31 -0800
Message-ID	<f9a98231-e6df-4557-8ca1-20d9644825ca@googlegroups.com>
In reply to	#98932

Hi Peter,

On Tuesday, November 17, 2015 at 4:57:57 PM UTC+1, Peter Otten wrote:
> Andrea Gavana wrote:
> 
> > Hi Chris,
> > 
> > On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote:
> >> On Wed, Nov 18, 2015 at 1:20 AM,  Andrea Gavana wrote:
> >> > Thank you for your answer. I do get similar timings when I swap the two
> >> > functions, and specifically still 15 seconds to read the file via
> >> > file.read() and 2.4 seconds (more or less as before) via
> >> > cPickle.load(fid).
> >> >
> >> > I thought that the order of operations might be an issue but apparently
> >> > that was not the case...
> >> 
> >> What if you call one of them twice and then the other? Just trying to
> >> rule out any possibility that it's a caching problem.
> >> 
> >> On my Linux box, running 2.7.9 64-bit, the two operations take roughly
> >> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
> >> loads). Are you able to run this off a RAM disk or something?
> >> 
> >> Most curious.
> > 
> > 
> > Thank you for taking the time to run my little script. I have now run it
> > with multiple combinations of calls (twice the first then the other, then
> > viceversa, then alternate between the two functions multiple times, then
> > three times the second and once the first, ...) with no luck at all.
> > 
> > The file.read() line of code takes always at minimum 14 seconds (in all
> > the trials I have done), while the cPickle.load call ranges between 2.3
> > and 2.5 seconds.
> > 
> > I am puzzled with no end... Might there be something funny with my C
> > libraries that use fread? I'm just shooting in the dark. I have a standard
> > Python installation on Windows, nothing fancy :-(
> 
> Perhaps there is a size threshold? You could experiment with different block 
> sizes in the following f.read() replacement:
> 
> def read_chunked(f, size=2**20):
>     read = functools.partial(f.read, size)
>     return "".join(iter(read, ""))


Thank you for the suggestion. I have used the read_chunked function in my experiments now and I can report a nice improvements - I have tried various chunk sizes, from 2**10 to 2**31-1, and in general the optimum lies around size=2**22, although it is essentially flat from 2**20 up to 2**30 - with some interesting spikes at 45 seconds for 2**14 and 2**15 (see table below).

Using your suggestion, I got it down to 3.4 seconds (on average). Still at least twice slower than cPickle.load, but better. 

What I find most puzzling is that a pure file.read() (or your read_chunked variation) should normally be much faster than a cPickle.load (which does so many more things than just reading a file), shouldn't it?


Timing table:

Size (power of 2)	Read Time (seconds)
10	9.14
11	8.59
12	7.67
13	5.70
14	46.06
15	45.00
16	24.80
17	14.23
18	8.95
19	5.58
20	3.41
21	3.39
22	3.34
23	3.39
24	3.39
25	3.42
26	3.43
27	3.44
28	3.48
29	3.59
30	3.72

[toc] | [prev] | [next] | [standalone]

#98938

From	Peter Otten <__peter__@web.de>
Date	2015-11-17 18:20 +0100
Message-ID	<mailman.399.1447780855.16136.python-list@python.org>
In reply to	#98936

andrea.gavana@gmail.com wrote:

>> > I am puzzled with no end... Might there be something funny with my C
>> > libraries that use fread? I'm just shooting in the dark. I have a
>> > standard Python installation on Windows, nothing fancy :-(
>> 
>> Perhaps there is a size threshold? You could experiment with different
>> block sizes in the following f.read() replacement:
>> 
>> def read_chunked(f, size=2**20):
>>     read = functools.partial(f.read, size)
>>     return "".join(iter(read, ""))
> 
> 
> Thank you for the suggestion. I have used the read_chunked function in my
> experiments now and I can report a nice improvements - I have tried
> various chunk sizes, from 2**10 to 2**31-1, and in general the optimum
> lies around size=2**22, although it is essentially flat from 2**20 up to
> 2**30 - with some interesting spikes at 45 seconds for 2**14 and 2**15
> (see table below).
> 
> Using your suggestion, I got it down to 3.4 seconds (on average). Still at
> least twice slower than cPickle.load, but better.
> 
> What I find most puzzling is that a pure file.read() (or your read_chunked
> variation) should normally be much faster than a cPickle.load (which does
> so many more things than just reading a file), shouldn't it?

That would have been my expectation, too. 

I had a quick look into the fileobject.c source and didn't see anything that 
struck me as suspicious.

I think you should file a bug report so that an expert can check if there is 
an underlying problem in Python or if it is a matter of the OS. 

> Timing table:
> 
> Size (power of 2)	Read Time (seconds)
> 10	9.14
> 11	8.59
> 12	7.67
> 13	5.70
> 14	46.06
> 15	45.00
> 16	24.80
> 17	14.23
> 18	8.95
> 19	5.58
> 20	3.41
> 21	3.39
> 22	3.34
> 23	3.39
> 24	3.39
> 25	3.42
> 26	3.43
> 27	3.44
> 28	3.48
> 29	3.59
> 30	3.72

[toc] | [prev] | [next] | [standalone]

#98952

From	Nagy László Zsolt <gandalf@shopzeus.com>
Date	2015-11-18 10:00 +0100
Message-ID	<mailman.406.1447837229.16136.python-list@python.org>
In reply to	#98930

> Perhaps there is a size threshold? You could experiment with different block 
> sizes in the following f.read() replacement:
>
> def read_chunked(f, size=2**20):
>     read = functools.partial(f.read, size)
>     return "".join(iter(read, ""))
>
Under win32 platform, my experience is that the fastest way to read
binary file from disk is the mmap module. You should try that too.

[toc] | [prev] | [next] | [standalone]

#98956

From	andrea.gavana@gmail.com
Date	2015-11-18 02:31 -0800
Message-ID	<f2a42caa-2cde-455b-83bb-3beef8e4c52f@googlegroups.com>
In reply to	#98952

Hi,

On Wednesday, November 18, 2015 at 10:00:43 AM UTC+1, Nagy László Zsolt wrote:
> > Perhaps there is a size threshold? You could experiment with different block 
> > sizes in the following f.read() replacement:
> >
> > def read_chunked(f, size=2**20):
> >     read = functools.partial(f.read, size)
> >     return "".join(iter(read, ""))
> >
> Under win32 platform, my experience is that the fastest way to read
> binary file from disk is the mmap module. You should try that too.

Thank you for your suggestion. I have tried that now, and with my naive approach I have done this:

    start = time.time()
    fid = open(filename, 'r+b')
    strs = mmap.mmap(fid.fileno(), 0, access=mmap.ACCESS_READ)[:]
    end = time.time()
    print 'mmap.read time:', end-start

And it takes about 2.7 seconds. Not a bad improvement :-) . Unfortunately, when the file is on a network drive, all the other approaches ran at around 25-30 seconds loading time, while the mmap one clocks at 110 seconds :-(

Andrea.

[toc] | [prev] | [standalone]

csiph-web

cPickle.load vs. file.read+cPickle.loads on large binary files

Contents

#98922 — cPickle.load vs. file.read+cPickle.loads on large binary files

#98923

#98924

#98928

#98930

#98932

#98936

#98938

#98952

#98956