Groups > comp.lang.python > #21544 > unrolled thread

Re: Fast file data retrieval?

Started by	MRAB <python@mrabarnett.plus.com>
First post	2012-03-12 20:31 +0000
Last post	2012-03-21 17:32 +0100
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Fast file data retrieval? MRAB <python@mrabarnett.plus.com> - 2012-03-12 20:31 +0000
    Re: Fast file data retrieval? Jon Clements <joncle@googlemail.com> - 2012-03-12 20:38 -0700
    Re: Fast file data retrieval? Jon Clements <joncle@googlemail.com> - 2012-03-12 20:38 -0700
    Re: Fast file data retrieval? Jorgen Grahn <grahn+nntp@snipabacken.se> - 2012-03-13 20:44 +0000
      Re: Fast file data retrieval? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-21 17:32 +0100

#21544 — Re: Fast file data retrieval?

From	MRAB <python@mrabarnett.plus.com>
Date	2012-03-12 20:31 +0000
Subject	Re: Fast file data retrieval?
Message-ID	<mailman.592.1331584145.3037.python-list@python.org>

On 12/03/2012 19:39, Virgil Stokes wrote:
> I have a rather large ASCII file that is structured as follows
>
> header line
> 9 nonblank lines with alphanumeric data
> header line
> 9 nonblank lines with alphanumeric data
> ...
> ...
> ...
> header line
> 9 nonblank lines with alphanumeric data
> EOF
>
> where, a data set contains 10 lines (header + 9 nonblank) and there can
> be several thousand
> data sets in a single file. In addition,*each header has a* *unique ID
> code*.
>
> Is there a fast method for the retrieval of a data set from this large
> file given its ID code?
>
Probably the best solution is to put it into a database. Have a look at
the sqlite3 module.

Alternatively, you could scan the file, recording the ID and the file
offset in a dict so that, given an ID, you can seek directly to that
file position.

[toc] | [next] | [standalone]

#21550

From	Jon Clements <joncle@googlemail.com>
Date	2012-03-12 20:38 -0700
Message-ID	<mailman.599.1331609909.3037.python-list@python.org>
In reply to	#21544

On Monday, 12 March 2012 20:31:35 UTC, MRAB  wrote:
> On 12/03/2012 19:39, Virgil Stokes wrote:
> > I have a rather large ASCII file that is structured as follows
> >
> > header line
> > 9 nonblank lines with alphanumeric data
> > header line
> > 9 nonblank lines with alphanumeric data
> > ...
> > ...
> > ...
> > header line
> > 9 nonblank lines with alphanumeric data
> > EOF
> >
> > where, a data set contains 10 lines (header + 9 nonblank) and there can
> > be several thousand
> > data sets in a single file. In addition,*each header has a* *unique ID
> > code*.
> >
> > Is there a fast method for the retrieval of a data set from this large
> > file given its ID code?
> >
> Probably the best solution is to put it into a database. Have a look at
> the sqlite3 module.
> 
> Alternatively, you could scan the file, recording the ID and the file
> offset in a dict so that, given an ID, you can seek directly to that
> file position.

I would have a look at either bsddb, Tokyo (or Kyoto) Cabinet or hamsterdb. If it's really going to get large and needs a full blown server, maybe MongoDB/redis/hadoop...

[toc] | [prev] | [next] | [standalone]

#21551

From	Jon Clements <joncle@googlemail.com>
Date	2012-03-12 20:38 -0700
Message-ID	<8469277.2076.1331609905709.JavaMail.geo-discussion-forums@vbai14>
In reply to	#21544

On Monday, 12 March 2012 20:31:35 UTC, MRAB  wrote:
> On 12/03/2012 19:39, Virgil Stokes wrote:
> > I have a rather large ASCII file that is structured as follows
> >
> > header line
> > 9 nonblank lines with alphanumeric data
> > header line
> > 9 nonblank lines with alphanumeric data
> > ...
> > ...
> > ...
> > header line
> > 9 nonblank lines with alphanumeric data
> > EOF
> >
> > where, a data set contains 10 lines (header + 9 nonblank) and there can
> > be several thousand
> > data sets in a single file. In addition,*each header has a* *unique ID
> > code*.
> >
> > Is there a fast method for the retrieval of a data set from this large
> > file given its ID code?
> >
> Probably the best solution is to put it into a database. Have a look at
> the sqlite3 module.
> 
> Alternatively, you could scan the file, recording the ID and the file
> offset in a dict so that, given an ID, you can seek directly to that
> file position.

I would have a look at either bsddb, Tokyo (or Kyoto) Cabinet or hamsterdb. If it's really going to get large and needs a full blown server, maybe MongoDB/redis/hadoop...

[toc] | [prev] | [next] | [standalone]

#21583

From	Jorgen Grahn <grahn+nntp@snipabacken.se>
Date	2012-03-13 20:44 +0000
Message-ID	<slrnjlvcdk.1ls.grahn+nntp@frailea.sa.invalid>
In reply to	#21544

On Mon, 2012-03-12, MRAB wrote:
> On 12/03/2012 19:39, Virgil Stokes wrote:
>> I have a rather large ASCII file that is structured as follows
>>
>> header line
>> 9 nonblank lines with alphanumeric data
>> header line
>> 9 nonblank lines with alphanumeric data
>> ...
>> ...
>> ...
>> header line
>> 9 nonblank lines with alphanumeric data
>> EOF
>>
>> where, a data set contains 10 lines (header + 9 nonblank) and there can
>> be several thousand
>> data sets in a single file. In addition,*each header has a* *unique ID
>> code*.
>>
>> Is there a fast method for the retrieval of a data set from this large
>> file given its ID code?

[Responding here since the original is not available on my server.]

It depends on what you want to do. Access a few of the entries (what
you call data sets) from your program? Process all of them?  How fast
do you need it to be?

> Probably the best solution is to put it into a database. Have a look at
> the sqlite3 module.

Some people like to use databases for everything, others never use
them. I'm in the latter crowd, so to me this sounds as overkill, and
possibly impractical. What if he has to keep the text file around? A
database on disk would mean duplicating the data. A database in memory
would not offer any benefits over a hash.

> Alternatively, you could scan the file, recording the ID and the file
> offset in a dict so that, given an ID, you can seek directly to that
> file position.

Mmapping the file (the mmap module) is another option.
But I wonder if this really would improve things.

"Several thousand" entries is not much these days. If a line is 80
characters, 5000 entries would take ~3MB of memory. The time to move
this from disk to a Python list of 9-tuples of strings would be almost
only disk I/O.

I think he should try to do it the dumb way first: read everything
into memory once.

/Jorgen

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .

[toc] | [prev] | [next] | [standalone]

#21990

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-03-21 17:32 +0100
Message-ID	<mailman.863.1332347577.3037.python-list@python.org>
In reply to	#21583

Jorgen Grahn, 13.03.2012 21:44:
> On Mon, 2012-03-12, MRAB wrote:
>> Probably the best solution is to put it into a database. Have a look at
>> the sqlite3 module.
> 
> Some people like to use databases for everything, others never use
> them. I'm in the latter crowd, so to me this sounds as overkill

Well, there's databases and databases. I agree that the complexity of a SQL
database is likely unnecessary here since a key-value database (any of the
dbm modules) appears to be sufficient from what the OP wrote.

Stefan

[toc] | [prev] | [standalone]

csiph-web

Re: Fast file data retrieval?

Contents

#21544 — Re: Fast file data retrieval?

#21550

#21551

#21583

#21990