Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21544 > unrolled thread
| Started by | MRAB <python@mrabarnett.plus.com> |
|---|---|
| First post | 2012-03-12 20:31 +0000 |
| Last post | 2012-03-21 17:32 +0100 |
| Articles | 5 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Fast file data retrieval? MRAB <python@mrabarnett.plus.com> - 2012-03-12 20:31 +0000
Re: Fast file data retrieval? Jon Clements <joncle@googlemail.com> - 2012-03-12 20:38 -0700
Re: Fast file data retrieval? Jon Clements <joncle@googlemail.com> - 2012-03-12 20:38 -0700
Re: Fast file data retrieval? Jorgen Grahn <grahn+nntp@snipabacken.se> - 2012-03-13 20:44 +0000
Re: Fast file data retrieval? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-21 17:32 +0100
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-03-12 20:31 +0000 |
| Subject | Re: Fast file data retrieval? |
| Message-ID | <mailman.592.1331584145.3037.python-list@python.org> |
On 12/03/2012 19:39, Virgil Stokes wrote: > I have a rather large ASCII file that is structured as follows > > header line > 9 nonblank lines with alphanumeric data > header line > 9 nonblank lines with alphanumeric data > ... > ... > ... > header line > 9 nonblank lines with alphanumeric data > EOF > > where, a data set contains 10 lines (header + 9 nonblank) and there can > be several thousand > data sets in a single file. In addition,*each header has a* *unique ID > code*. > > Is there a fast method for the retrieval of a data set from this large > file given its ID code? > Probably the best solution is to put it into a database. Have a look at the sqlite3 module. Alternatively, you could scan the file, recording the ID and the file offset in a dict so that, given an ID, you can seek directly to that file position.
[toc] | [next] | [standalone]
| From | Jon Clements <joncle@googlemail.com> |
|---|---|
| Date | 2012-03-12 20:38 -0700 |
| Message-ID | <mailman.599.1331609909.3037.python-list@python.org> |
| In reply to | #21544 |
On Monday, 12 March 2012 20:31:35 UTC, MRAB wrote: > On 12/03/2012 19:39, Virgil Stokes wrote: > > I have a rather large ASCII file that is structured as follows > > > > header line > > 9 nonblank lines with alphanumeric data > > header line > > 9 nonblank lines with alphanumeric data > > ... > > ... > > ... > > header line > > 9 nonblank lines with alphanumeric data > > EOF > > > > where, a data set contains 10 lines (header + 9 nonblank) and there can > > be several thousand > > data sets in a single file. In addition,*each header has a* *unique ID > > code*. > > > > Is there a fast method for the retrieval of a data set from this large > > file given its ID code? > > > Probably the best solution is to put it into a database. Have a look at > the sqlite3 module. > > Alternatively, you could scan the file, recording the ID and the file > offset in a dict so that, given an ID, you can seek directly to that > file position. I would have a look at either bsddb, Tokyo (or Kyoto) Cabinet or hamsterdb. If it's really going to get large and needs a full blown server, maybe MongoDB/redis/hadoop...
[toc] | [prev] | [next] | [standalone]
| From | Jon Clements <joncle@googlemail.com> |
|---|---|
| Date | 2012-03-12 20:38 -0700 |
| Message-ID | <8469277.2076.1331609905709.JavaMail.geo-discussion-forums@vbai14> |
| In reply to | #21544 |
On Monday, 12 March 2012 20:31:35 UTC, MRAB wrote: > On 12/03/2012 19:39, Virgil Stokes wrote: > > I have a rather large ASCII file that is structured as follows > > > > header line > > 9 nonblank lines with alphanumeric data > > header line > > 9 nonblank lines with alphanumeric data > > ... > > ... > > ... > > header line > > 9 nonblank lines with alphanumeric data > > EOF > > > > where, a data set contains 10 lines (header + 9 nonblank) and there can > > be several thousand > > data sets in a single file. In addition,*each header has a* *unique ID > > code*. > > > > Is there a fast method for the retrieval of a data set from this large > > file given its ID code? > > > Probably the best solution is to put it into a database. Have a look at > the sqlite3 module. > > Alternatively, you could scan the file, recording the ID and the file > offset in a dict so that, given an ID, you can seek directly to that > file position. I would have a look at either bsddb, Tokyo (or Kyoto) Cabinet or hamsterdb. If it's really going to get large and needs a full blown server, maybe MongoDB/redis/hadoop...
[toc] | [prev] | [next] | [standalone]
| From | Jorgen Grahn <grahn+nntp@snipabacken.se> |
|---|---|
| Date | 2012-03-13 20:44 +0000 |
| Message-ID | <slrnjlvcdk.1ls.grahn+nntp@frailea.sa.invalid> |
| In reply to | #21544 |
On Mon, 2012-03-12, MRAB wrote: > On 12/03/2012 19:39, Virgil Stokes wrote: >> I have a rather large ASCII file that is structured as follows >> >> header line >> 9 nonblank lines with alphanumeric data >> header line >> 9 nonblank lines with alphanumeric data >> ... >> ... >> ... >> header line >> 9 nonblank lines with alphanumeric data >> EOF >> >> where, a data set contains 10 lines (header + 9 nonblank) and there can >> be several thousand >> data sets in a single file. In addition,*each header has a* *unique ID >> code*. >> >> Is there a fast method for the retrieval of a data set from this large >> file given its ID code? [Responding here since the original is not available on my server.] It depends on what you want to do. Access a few of the entries (what you call data sets) from your program? Process all of them? How fast do you need it to be? > Probably the best solution is to put it into a database. Have a look at > the sqlite3 module. Some people like to use databases for everything, others never use them. I'm in the latter crowd, so to me this sounds as overkill, and possibly impractical. What if he has to keep the text file around? A database on disk would mean duplicating the data. A database in memory would not offer any benefits over a hash. > Alternatively, you could scan the file, recording the ID and the file > offset in a dict so that, given an ID, you can seek directly to that > file position. Mmapping the file (the mmap module) is another option. But I wonder if this really would improve things. "Several thousand" entries is not much these days. If a line is 80 characters, 5000 entries would take ~3MB of memory. The time to move this from disk to a Python list of 9-tuples of strings would be almost only disk I/O. I think he should try to do it the dumb way first: read everything into memory once. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o .
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-03-21 17:32 +0100 |
| Message-ID | <mailman.863.1332347577.3037.python-list@python.org> |
| In reply to | #21583 |
Jorgen Grahn, 13.03.2012 21:44: > On Mon, 2012-03-12, MRAB wrote: >> Probably the best solution is to put it into a database. Have a look at >> the sqlite3 module. > > Some people like to use databases for everything, others never use > them. I'm in the latter crowd, so to me this sounds as overkill Well, there's databases and databases. I agree that the complexity of a SQL database is likely unnecessary here since a key-value database (any of the dbm modules) appears to be sufficient from what the OP wrote. Stefan
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web