Groups > comp.lang.python > #56871 > unrolled thread

How pickle helps in reading huge files?

Started by	Harsh Jha <harshjha2006@gmail.com>
First post	2013-10-15 23:55 -0700
Last post	2013-10-16 23:09 +0200
Articles	9 — 9 participants

Back to article view | Back to comp.lang.python

  How pickle helps in reading huge files? Harsh Jha <harshjha2006@gmail.com> - 2013-10-15 23:55 -0700
    Re: How pickle helps in reading huge files? Stephane Wirtel <stephane@wirtel.be> - 2013-10-16 09:05 +0200
      Re: How pickle helps in reading huge files? rusi <rustompmody@gmail.com> - 2013-10-16 01:51 -0700
        Re: How pickle helps in reading huge files? Chris Angelico <rosuav@gmail.com> - 2013-10-16 20:09 +1100
    Re: How pickle helps in reading huge files? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-16 08:39 +0100
    Re: How pickle helps in reading huge files? Roy Smith <roy@panix.com> - 2013-10-16 08:29 -0400
    Re: How pickle helps in reading huge files? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-16 13:32 -0400
    Re: How pickle helps in reading huge files? Peter Cacioppi <peter.cacioppi@gmail.com> - 2013-10-16 14:04 -0700
      Re: How pickle helps in reading huge files? Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2013-10-16 23:09 +0200

#56871 — How pickle helps in reading huge files?

From	Harsh Jha <harshjha2006@gmail.com>
Date	2013-10-15 23:55 -0700
Subject	How pickle helps in reading huge files?
Message-ID	<0044bfd0-f07f-4f7b-b976-5df034b6fec6@googlegroups.com>

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?

Thank you.

[toc] | [next] | [standalone]

#56872

From	Stephane Wirtel <stephane@wirtel.be>
Date	2013-10-16 09:05 +0200
Message-ID	<mailman.1107.1381907510.18130.python-list@python.org>
In reply to	#56871

Keep it in memory 

> On 16 oct. 2013, at 08:55 AM, Harsh Jha <harshjha2006@gmail.com> wrote:
> 
> I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?
> 
> Thank you.
> -- 
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#56874

From	rusi <rustompmody@gmail.com>
Date	2013-10-16 01:51 -0700
Message-ID	<81e53ed7-cc3e-437d-966d-9c1d79dc8c9f@googlegroups.com>
In reply to	#56872

On Wednesday, October 16, 2013 12:35:42 PM UTC+5:30, Stéphane Wirtel wrote:
> Keep it in memory 

Thats a strange answer given that the OP says his file is huge.
Of course 'huge' may not really be huge -- that really depends on the h/w he's using.

[toc] | [prev] | [next] | [standalone]

#56875

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-16 20:09 +1100
Message-ID	<mailman.1109.1381914565.18130.python-list@python.org>
In reply to	#56874

On Wed, Oct 16, 2013 at 7:51 PM, rusi <rustompmody@gmail.com> wrote:
> On Wednesday, October 16, 2013 12:35:42 PM UTC+5:30, Stéphane Wirtel wrote:
>> Keep it in memory
>
> Thats a strange answer given that the OP says his file is huge.
> Of course 'huge' may not really be huge -- that really depends on the h/w he's using.

Most people's idea of a big file is one that has a few thousand lines
in it. That may be pretty huge in terms of manual work, but it'd fit
inside memory easily enough. And even if it really is bigger than
memory, chances are you can use your page file and still keep it in
"memory" - and that's generally the easiest, if perhaps not the most
efficient, solution.

ChrisA

[toc] | [prev] | [next] | [standalone]

#56873

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-16 08:39 +0100
Message-ID	<mailman.1108.1381909219.18130.python-list@python.org>
In reply to	#56871

On 16/10/2013 07:55, Harsh Jha wrote:
> I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?
>
> Thank you.
>

What's your definition of huge?  Maybe it would be effective to pickle 
and unpickle but until you try it, perhaps with a relatively small data 
sample, how can you know?  Why can't you leave the file open and keep 
iterating over the contents?

-- 
Roses are red,
Violets are blue,
Most poems rhyme,
But this one doesn't.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#56882

From	Roy Smith <roy@panix.com>
Date	2013-10-16 08:29 -0400
Message-ID	<roy-0A1DFD.08294616102013@news.panix.com>
In reply to	#56871

In article <0044bfd0-f07f-4f7b-b976-5df034b6fec6@googlegroups.com>,
 Harsh Jha <harshjha2006@gmail.com> wrote:

> I've a huge csv file and I want to read stuff from it again and again. Is it 
> useful to pickle it and keep and then unpickle it whenever I need to use that 
> data? Is it faster that accessing that file simply by opening it again and 
> again? Please explain, why?
> 
> Thank you.

It can be.  I did a project a bunch of years ago which involved reading 
(and parsing) SNMP MIBs before you could do any work.  Startup took 
something like 10-20 seconds.  If I pre-parsed the MIBs and wrote out 
the data structures as pickles, I could cut startup time to a couple of 
seconds.

But, that's because the parsing I was doing was pretty complicated.  
Parsing a CSV file is much easier, so I wouldn't expect you to have much 
improvement reading a pickle file vs. reading the original CSV.

The bottom line is, you should try it.  Pickling a data structure is 
about one line of code (not counting the 'import cPickle').  Try it and 
see what happens.  Time how long it takes to read the original file, and 
how long it takes to read the pickle.  Let us know your results.

Also, let us know what "huge" means.  1000 rows?  A million?  100 
million?

[toc] | [prev] | [next] | [standalone]

#56887

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-10-16 13:32 -0400
Message-ID	<mailman.1115.1381944762.18130.python-list@python.org>
In reply to	#56871

On Tue, 15 Oct 2013 23:55:26 -0700 (PDT), Harsh Jha
<harshjha2006@gmail.com> declaimed the following:

>I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?
>
	As others mention, what is "huge"?

	Does it get updated often? How extensive are updates?

	I suspect I'd use the CSV module to parse it into an SQLite3 database,
then use the database for the repetitive access. NOTE: I've never used
pickle -- but for stuff that is coming in as simple CSV I'd suspect the
parsing (even including the various int()/float() wrapping of numeric
fields) can't be much slower than the object creation/unwrapping used by
pickle; SQLite3 should let you leave the data in numeric formats without
the translation penalty on each use.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#56897

From	Peter Cacioppi <peter.cacioppi@gmail.com>
Date	2013-10-16 14:04 -0700
Message-ID	<7e49229c-4dc7-43f3-8785-b72c1ef30018@googlegroups.com>
In reply to	#56871

On Tuesday, October 15, 2013 11:55:26 PM UTC-7, Harsh Jha wrote:
> I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?
> 
> 
> 
> Thank you.

Surprising no-one else mentioned a fairly typical pattern for this sort of situation - the compromise between "read from disk" and "read from memory" is "implement a cache".

I've had lots of good experiences hand rolling simple caches, especially if there is an application specific access pattern.

Python has nice implementations of things like tuple and dictionary which make caching fairly easy compared to other languages.

[toc] | [prev] | [next] | [standalone]

#56898

From	Irmen de Jong <irmen.NOSPAM@xs4all.nl>
Date	2013-10-16 23:09 +0200
Message-ID	<525f008a$0$15895$e4fe514c@news.xs4all.nl>
In reply to	#56897

On 16-10-2013 23:04, Peter Cacioppi wrote:
> On Tuesday, October 15, 2013 11:55:26 PM UTC-7, Harsh Jha wrote:
>> I've a huge csv file and I want to read stuff from it again and again. Is it useful
>> to pickle it and keep and then unpickle it whenever I need to use that data? Is it
>> faster that accessing that file simply by opening it again and again? Please
>> explain, why?
>> 
>> 
>> 
>> Thank you.
> 
> Surprising no-one else mentioned a fairly typical pattern for this sort of situation
> - the compromise between "read from disk" and "read from memory" is "implement a
> cache".

...or: use memory mapped I/O. Just let the OS deal with the 'caching' of memory pages.

Irmen

[toc] | [prev] | [standalone]

csiph-web

How pickle helps in reading huge files?

Contents

#56871 — How pickle helps in reading huge files?

#56872

#56874

#56875

#56873

#56882

#56887

#56897

#56898