Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #31933 > unrolled thread
| Started by | Virgil Stokes <vs@it.uu.se> |
|---|---|
| First post | 2012-10-23 16:31 +0200 |
| Last post | 2012-10-25 11:00 +0200 |
| Articles | 8 — 4 participants |
Back to article view | Back to comp.lang.python
Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-23 16:31 +0200
Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:17 -0700
Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:22 -0700
Re: Fast forward-backward (write-read) Tim Chase <python.list@tim.thechases.com> - 2012-10-23 11:53 -0500
Re: Fast forward-backward (write-read) Paul Rubin <no.email@nospam.invalid> - 2012-10-23 09:58 -0700
Re: Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-23 19:06 +0200
Re: Fast forward-backward (write-read) rusi <rustompmody@gmail.com> - 2012-10-24 08:11 -0700
Re: Fast forward-backward (write-read) Virgil Stokes <vs@it.uu.se> - 2012-10-25 11:00 +0200
| From | Virgil Stokes <vs@it.uu.se> |
|---|---|
| Date | 2012-10-23 16:31 +0200 |
| Subject | Fast forward-backward (write-read) |
| Message-ID | <mailman.2667.1351003942.27098.python-list@python.org> |
I am working with some rather large data files (>100GB) that contain time series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform various types of processing on these data (e.g. moving median, moving average, and Kalman-filter, Kalman-smoother) in a sequential manner and only a small number of these data need be stored in RAM when being processed. When performing Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these variables saved to an external file from the forward pass, in reverse order --- from last written to first written. Finally, to my question --- What is a fast way to write these variables to an external file and then read them in backwards?
[toc] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-10-23 09:17 -0700 |
| Message-ID | <7xehkpxiw0.fsf@ruckus.brouhaha.com> |
| In reply to | #31933 |
Virgil Stokes <vs@it.uu.se> writes: > Finally, to my question --- What is a fast way to write these > variables to an external file and then read them in backwards? Seeking backwards in files works, but the performance hit is significant. There is also a performance hit to scanning pointers backwards in memory, due to cache misprediction. If it's something you're just running a few times, seeking backwards the simplest approach. If you're really trying to optimize the thing, you might buffer up large chunks (like 1 MB) before writing. If you're writing once and reading multiple times, you might reverse the order of records within the chunks during the writing phase. You're of course taking a performance bath from writing the program in Python to begin with (unless using scipy/numpy or the like), enough that it might dominate any effects of how the files are written. Of course (it should go without saying) that you want to dump in a binary format rather than converting to decimal.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-10-23 09:22 -0700 |
| Message-ID | <7xmwzdkvk2.fsf@ruckus.brouhaha.com> |
| In reply to | #31937 |
Paul Rubin <no.email@nospam.invalid> writes: > Seeking backwards in files works, but the performance hit is > significant. There is also a performance hit to scanning pointers > backwards in memory, due to cache misprediction. If it's something > you're just running a few times, seeking backwards the simplest > approach. Oh yes, I should have mentioned, it may be simpler and perhaps a little bit faster to use mmap rather than seeking.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2012-10-23 11:53 -0500 |
| Message-ID | <mailman.2676.1351011158.27098.python-list@python.org> |
| In reply to | #31937 |
On 10/23/12 11:17, Paul Rubin wrote: > Virgil Stokes <vs@it.uu.se> writes: >> Finally, to my question --- What is a fast way to write these >> variables to an external file and then read them in backwards? > > Seeking backwards in files works, but the performance hit is > significant. There is also a performance hit to scanning pointers > backwards in memory, due to cache misprediction. If it's something > you're just running a few times, seeking backwards the simplest > approach. If you're really trying to optimize the thing, you might > buffer up large chunks (like 1 MB) before writing. If you're writing > once and reading multiple times, you might reverse the order of records > within the chunks during the writing phase. I agree with Paul here, it's been a while since I did it, and my dataset was small enough (and passed through once) so I just let it run. Writing larger chunks is definitely a good way to go. > You're of course taking a performance bath from writing the program in > Python to begin with (unless using scipy/numpy or the like), enough that > it might dominate any effects of how the files are written. I usually find that the I/O almost always overwhelms the actual processing. > Of course (it should go without saying) that you want to dump in a > binary format rather than converting to decimal. Again, the conversion to/from decimal hasn't been a great cost in my experience, as it's overwhelmed by the I/O cost of shoveling the data to/from disk. -tkc
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-10-23 09:58 -0700 |
| Message-ID | <7xliexktvl.fsf@ruckus.brouhaha.com> |
| In reply to | #31943 |
Tim Chase <python.list@tim.thechases.com> writes: > Again, the conversion to/from decimal hasn't been a great cost in my > experience, as it's overwhelmed by the I/O cost of shoveling the > data to/from disk. I've found that cpu costs both for processing and conversion are significant. Also, using a binary format makes the file a lot smaller, which decreases the i/o cost as well eliminating the conversion cost. And, the conversion can introduce precision loss, another thing to be avoided. The famous "butterfly effect" was serendipitously discovered that way.
[toc] | [prev] | [next] | [standalone]
| From | Virgil Stokes <vs@it.uu.se> |
|---|---|
| Date | 2012-10-23 19:06 +0200 |
| Message-ID | <mailman.2678.1351013559.27098.python-list@python.org> |
| In reply to | #31937 |
On 23-Oct-2012 18:17, Paul Rubin wrote: > Virgil Stokes <vs@it.uu.se> writes: >> Finally, to my question --- What is a fast way to write these >> variables to an external file and then read them in backwards? > Seeking backwards in files works, but the performance hit is > significant. There is also a performance hit to scanning pointers > backwards in memory, due to cache misprediction. If it's something > you're just running a few times, seeking backwards the simplest > approach. If you're really trying to optimize the thing, you might > buffer up large chunks (like 1 MB) before writing. If you're writing > once and reading multiple times, you might reverse the order of records > within the chunks during the writing phase. I am writing (forward) once and reading (backward) once. > > You're of course taking a performance bath from writing the program in > Python to begin with (unless using scipy/numpy or the like), enough that > it might dominate any effects of how the files are written. I am currently using SciPy/NumPy > > Of course (it should go without saying) that you want to dump in a > binary format rather than converting to decimal. Yes, I am doing this (but thanks for "underlining" it!) Thanks Paul :-)
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2012-10-24 08:11 -0700 |
| Message-ID | <069780e5-43ee-48de-9257-fc0f16ec39d7@s9g2000pbh.googlegroups.com> |
| In reply to | #31933 |
On Oct 23, 7:52 pm, Virgil Stokes <v...@it.uu.se> wrote: > I am working with some rather large data files (>100GB) that contain time series > data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform > various types of processing on these data (e.g. moving median, moving average, > and Kalman-filter, Kalman-smoother) in a sequential manner and only a small > number of these data need be stored in RAM when being processed. When performing > Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an > external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These > are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). > Thus, I will need to input these variables saved to an external file from the > forward pass, in reverse order --- from last written to first written. > > Finally, to my question --- What is a fast way to write these variables to an > external file and then read them in backwards? Have you tried gdbm/bsddbm? They are meant for such (I believe). Probably needs to be installed for windows; works for linux. If I were you I'd try out with the giant data on linux and see if the problem is solved, then see how to install for windows
[toc] | [prev] | [next] | [standalone]
| From | Virgil Stokes <vs@it.uu.se> |
|---|---|
| Date | 2012-10-25 11:00 +0200 |
| Message-ID | <mailman.2816.1351155659.27098.python-list@python.org> |
| In reply to | #32049 |
On 24-Oct-2012 17:11, rusi wrote: > On Oct 23, 7:52 pm, Virgil Stokes <v...@it.uu.se> wrote: >> I am working with some rather large data files (>100GB) that contain time series >> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform >> various types of processing on these data (e.g. moving median, moving average, >> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small >> number of these data need be stored in RAM when being processed. When performing >> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an >> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These >> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). >> Thus, I will need to input these variables saved to an external file from the >> forward pass, in reverse order --- from last written to first written. >> >> Finally, to my question --- What is a fast way to write these variables to an >> external file and then read them in backwards? > Have you tried gdbm/bsddbm? They are meant for such (I believe). > Probably needs to be installed for windows; works for linux. > If I were you I'd try out with the giant data on linux and see if the > problem is solved, then see how to install for windows Thanks Rusi :-)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web