Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #98936

Re: cPickle.load vs. file.read+cPickle.loads on large binary files

X-Received by 10.129.99.70 with SMTP id x67mr27914571ywb.5.1447777902268; Tue, 17 Nov 2015 08:31:42 -0800 (PST)
X-Received by 10.50.97.38 with SMTP id dx6mr80970igb.9.1447777902241; Tue, 17 Nov 2015 08:31:42 -0800 (PST)
Path csiph.com!au2pb.net!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!f78no411838qge.1!news-out.google.com!f6ni5405igq.0!nntp.google.com!i2no4309714igv.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups comp.lang.python
Date Tue, 17 Nov 2015 08:31:41 -0800 (PST)
In-Reply-To <mailman.395.1447775861.16136.python-list@python.org>
Complaints-To groups-abuse@google.com
Injection-Info glegroupsg2000goo.googlegroups.com; posting-host=195.249.183.252; posting-account=ZxV-SgoAAADbox0Kt5vMsxYlC8niBQCt
NNTP-Posting-Host 195.249.183.252
References <463ad93c-0186-4911-9cd1-92d97b9dc87b@googlegroups.com> <mailman.387.1447769670.16136.python-list@python.org> <54330891-6568-4469-93ae-7a7825961500@googlegroups.com> <mailman.392.1447773612.16136.python-list@python.org> <420ec4e9-6af6-49bd-a9f4-8b47ef1f136e@googlegroups.com> <mailman.395.1447775861.16136.python-list@python.org>
User-Agent G2/1.0
MIME-Version 1.0
Message-ID <f9a98231-e6df-4557-8ca1-20d9644825ca@googlegroups.com> (permalink)
Subject Re: cPickle.load vs. file.read+cPickle.loads on large binary files
From andrea.gavana@gmail.com
Injection-Date Tue, 17 Nov 2015 16:31:42 +0000
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-Received-Bytes 4528
X-Received-Body-CRC 2278088835
Xref csiph.com comp.lang.python:98936

Show key headers only | View raw


Hi Peter,

On Tuesday, November 17, 2015 at 4:57:57 PM UTC+1, Peter Otten wrote:
> Andrea Gavana wrote:
> 
> > Hi Chris,
> > 
> > On Tuesday, November 17, 2015 at 4:20:34 PM UTC+1, Chris Angelico wrote:
> >> On Wed, Nov 18, 2015 at 1:20 AM,  Andrea Gavana wrote:
> >> > Thank you for your answer. I do get similar timings when I swap the two
> >> > functions, and specifically still 15 seconds to read the file via
> >> > file.read() and 2.4 seconds (more or less as before) via
> >> > cPickle.load(fid).
> >> >
> >> > I thought that the order of operations might be an issue but apparently
> >> > that was not the case...
> >> 
> >> What if you call one of them twice and then the other? Just trying to
> >> rule out any possibility that it's a caching problem.
> >> 
> >> On my Linux box, running 2.7.9 64-bit, the two operations take roughly
> >> the same amount of time (1.8 seconds for load vs 1s to read and 0.8 to
> >> loads). Are you able to run this off a RAM disk or something?
> >> 
> >> Most curious.
> > 
> > 
> > Thank you for taking the time to run my little script. I have now run it
> > with multiple combinations of calls (twice the first then the other, then
> > viceversa, then alternate between the two functions multiple times, then
> > three times the second and once the first, ...) with no luck at all.
> > 
> > The file.read() line of code takes always at minimum 14 seconds (in all
> > the trials I have done), while the cPickle.load call ranges between 2.3
> > and 2.5 seconds.
> > 
> > I am puzzled with no end... Might there be something funny with my C
> > libraries that use fread? I'm just shooting in the dark. I have a standard
> > Python installation on Windows, nothing fancy :-(
> 
> Perhaps there is a size threshold? You could experiment with different block 
> sizes in the following f.read() replacement:
> 
> def read_chunked(f, size=2**20):
>     read = functools.partial(f.read, size)
>     return "".join(iter(read, ""))


Thank you for the suggestion. I have used the read_chunked function in my experiments now and I can report a nice improvements - I have tried various chunk sizes, from 2**10 to 2**31-1, and in general the optimum lies around size=2**22, although it is essentially flat from 2**20 up to 2**30 - with some interesting spikes at 45 seconds for 2**14 and 2**15 (see table below).

Using your suggestion, I got it down to 3.4 seconds (on average). Still at least twice slower than cPickle.load, but better. 

What I find most puzzling is that a pure file.read() (or your read_chunked variation) should normally be much faster than a cPickle.load (which does so many more things than just reading a file), shouldn't it?


Timing table:

Size (power of 2)	Read Time (seconds)
10	9.14
11	8.59
12	7.67
13	5.70
14	46.06
15	45.00
16	24.80
17	14.23
18	8.95
19	5.58
20	3.41
21	3.39
22	3.34
23	3.39
24	3.39
25	3.42
26	3.43
27	3.44
28	3.48
29	3.59
30	3.72

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 05:26 -0800
  Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 15:14 +0100
    Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 06:20 -0800
      Re: cPickle.load vs. file.read+cPickle.loads on large binary files Chris Angelico <rosuav@gmail.com> - 2015-11-18 02:20 +1100
        Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 07:31 -0800
          Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 16:57 +0100
            Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-17 08:31 -0800
              Re: cPickle.load vs. file.read+cPickle.loads on large binary files Peter Otten <__peter__@web.de> - 2015-11-17 18:20 +0100
          Re: cPickle.load vs. file.read+cPickle.loads on large binary files Nagy László Zsolt <gandalf@shopzeus.com> - 2015-11-18 10:00 +0100
            Re: cPickle.load vs. file.read+cPickle.loads on large binary files andrea.gavana@gmail.com - 2015-11-18 02:31 -0800

csiph-web