Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #31933 > unrolled thread

Fast forward-backward (write-read)

Started byVirgil Stokes <vs@it.uu.se>
First post2012-10-23 16:31 +0200
Last post2012-10-25 11:00 +0200
Articles 8 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-23 16:31 +0200
    Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:17 -0700
      Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:22 -0700
      Re: Fast forward-backward (write-read) Tim Chase <python.list@tim.thechases.com> - 2012-10-23 11:53 -0500
        Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:58 -0700
      Re: Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-23 19:06 +0200
    Re: Fast forward-backward (write-read) rusi <rustompmody@gmail.com> - 2012-10-24 08:11 -0700
      Re: Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-25 11:00 +0200

#31933 — Fast forward-backward (write-read)

FromVirgil Stokes <vs@it.uu.se>
Date2012-10-23 16:31 +0200
SubjectFast forward-backward (write-read)
Message-ID<mailman.2667.1351003942.27098.python-list@python.org>
I am working with some rather large data files (>100GB) that contain time series 
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform 
various types of processing on these data (e.g. moving median, moving average, 
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small 
number of these data need be stored in RAM when being processed. When performing 
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an 
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These 
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). 
Thus, I will need to input these variables saved to an external file from the 
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an 
external file and then read them in backwards?

[toc] | [next] | [standalone]


#31937

FromPaul Rubin <no.email@nospam.invalid>
Date2012-10-23 09:17 -0700
Message-ID<7xehkpxiw0.fsf@ruckus.brouhaha.com>
In reply to#31933
Virgil Stokes <vs@it.uu.se> writes:
> Finally, to my question --- What is a fast way to write these
> variables to an external file and then read them in backwards?

Seeking backwards in files works, but the performance hit is
significant.  There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction.  If it's something
you're just running a few times, seeking backwards the simplest
approach.  If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing.  If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.  

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

[toc] | [prev] | [next] | [standalone]


#31938

FromPaul Rubin <no.email@nospam.invalid>
Date2012-10-23 09:22 -0700
Message-ID<7xmwzdkvk2.fsf@ruckus.brouhaha.com>
In reply to#31937
Paul Rubin <no.email@nospam.invalid> writes:
> Seeking backwards in files works, but the performance hit is
> significant.  There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction.  If it's something
> you're just running a few times, seeking backwards the simplest
> approach. 

Oh yes, I should have mentioned, it may be simpler and perhaps a little
bit faster to use mmap rather than seeking.

[toc] | [prev] | [next] | [standalone]


#31943

FromTim Chase <python.list@tim.thechases.com>
Date2012-10-23 11:53 -0500
Message-ID<mailman.2676.1351011158.27098.python-list@python.org>
In reply to#31937
On 10/23/12 11:17, Paul Rubin wrote:
> Virgil Stokes <vs@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
> 
> Seeking backwards in files works, but the performance hit is
> significant.  There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction.  If it's something
> you're just running a few times, seeking backwards the simplest
> approach.  If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing.  If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.

I agree with Paul here, it's been a while since I did it, and my
dataset was small enough (and passed through once) so I just let it
run.  Writing larger chunks is definitely a good way to go.

> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.

I usually find that the I/O almost always overwhelms the actual
processing.

> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.

Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

-tkc

[toc] | [prev] | [next] | [standalone]


#31944

FromPaul Rubin <no.email@nospam.invalid>
Date2012-10-23 09:58 -0700
Message-ID<7xliexktvl.fsf@ruckus.brouhaha.com>
In reply to#31943
Tim Chase <python.list@tim.thechases.com> writes:
> Again, the conversion to/from decimal hasn't been a great cost in my
> experience, as it's overwhelmed by the I/O cost of shoveling the
> data to/from disk.

I've found that cpu costs both for processing and conversion are
significant.  Also, using a binary format makes the file a lot smaller,
which decreases the i/o cost as well eliminating the conversion cost.
And, the conversion can introduce precision loss, another thing to be
avoided.  The famous "butterfly effect" was serendipitously discovered
that way.

[toc] | [prev] | [next] | [standalone]


#31946

FromVirgil Stokes <vs@it.uu.se>
Date2012-10-23 19:06 +0200
Message-ID<mailman.2678.1351013559.27098.python-list@python.org>
In reply to#31937
On 23-Oct-2012 18:17, Paul Rubin wrote:
> Virgil Stokes <vs@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
> Seeking backwards in files works, but the performance hit is
> significant.  There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction.  If it's something
> you're just running a few times, seeking backwards the simplest
> approach.  If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing.  If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.
I am writing (forward) once and reading (backward) once.
>
> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.
I am currently using SciPy/NumPy
>
> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.
Yes, I am doing this (but thanks for "underlining" it!)

Thanks Paul :-)

[toc] | [prev] | [next] | [standalone]


#32049

Fromrusi <rustompmody@gmail.com>
Date2012-10-24 08:11 -0700
Message-ID<069780e5-43ee-48de-9257-fc0f16ec39d7@s9g2000pbh.googlegroups.com>
In reply to#31933
On Oct 23, 7:52 pm, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time series
> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
> various types of processing on these data (e.g. moving median, moving average,
> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
> number of these data need be stored in RAM when being processed. When performing
> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
> Thus, I will need to input these variables saved to an external file from the
> forward pass, in reverse order --- from last written to first written.
>
> Finally, to my question --- What is a fast way to write these variables to an
> external file and then read them in backwards?

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows

[toc] | [prev] | [next] | [standalone]


#32097

FromVirgil Stokes <vs@it.uu.se>
Date2012-10-25 11:00 +0200
Message-ID<mailman.2816.1351155659.27098.python-list@python.org>
In reply to#32049
On 24-Oct-2012 17:11, rusi wrote:
> On Oct 23, 7:52 pm, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB) that contain time series
>> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
>> various types of processing on these data (e.g. moving median, moving average,
>> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
>> number of these data need be stored in RAM when being processed. When performing
>> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
>> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
>> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
>> Thus, I will need to input these variables saved to an external file from the
>> forward pass, in reverse order --- from last written to first written.
>>
>> Finally, to my question --- What is a fast way to write these variables to an
>> external file and then read them in backwards?
> Have you tried gdbm/bsddbm? They are meant for such (I believe).
> Probably needs to be installed for windows; works for linux.
> If I were you I'd try out with the giant data on linux and see if the
> problem is solved, then see how to install for windows
Thanks Rusi :-)

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web