Groups > comp.sources.testers > #4 > unrolled thread

filesystem content tracking

Started by	Ivan Shmakov <oneingray@gmail.com>
First post	2012-03-14 21:53 +0700
Last post	2012-03-22 11:41 +0700
Articles	4 — 1 participant

Back to article view | Back to comp.sources.testers

  filesystem content tracking Ivan Shmakov <oneingray@gmail.com> - 2012-03-14 21:53 +0700
    Re: filesystem content tracking Ivan Shmakov <oneingray@gmail.com> - 2012-03-19 18:03 +0700
      Re: filesystem content tracking Ivan Shmakov <oneingray@gmail.com> - 2012-03-20 23:37 +0700
      Re: filesystem content tracking Ivan Shmakov <oneingray@gmail.com> - 2012-03-22 11:41 +0700

#4 — filesystem content tracking

From	Ivan Shmakov <oneingray@gmail.com>
Date	2012-03-14 21:53 +0700
Subject	filesystem content tracking
Message-ID	<86aa3j9quj.fsf@gray.siamics.net>

	[Cross-posting to news:comp.unix.shell, for the related
	questions are raised there from time to time.]

    More than occasionally, there's a need to create and maintain an
    "inventory" of the filesystem contents, to facilitate backup, or to
    maintain a history of file addition and removal, or for other
    purposes.  Below, I describe my current view on a possible design
    for a tool aiming to solve such a task (tentantively named FCCS, for
    Filesystem Content Checking (Control?) System), and relate my
    experiences with implementing and using of the versions of this tool
    relying on SQLite and a custom ASN.1-based file format for history
    storage.


    First try: SQLite

	Somewhat recently, I've given this task one more try.  My first
	approach was to use a SQLite database to record the current
	state and the history of filesystem changes.  The basic "units"
	of the filesystem state being tracked are:

	* the results of the stat(2) system call;

	* the message digests computed over (regular) file contents.

	Each time a "filesystem scan" is performed, the software records
	a "session", which binds filenames to stat(2) results and
	contents' digests.  Should there be a "previous" session, its
	data is "linked" to the newly created one, and is then used to
	avoid computing the digests for the unchanged files.

	Specifically, the digest is computed if the stat(2) results
	obtained for the file (sans the st_atime and st_dev fields) are
	/not/ bound to any (filename, digest) pair within the current or
	"previous" session.

	On the other hand, if there /is/ such a binding, the digest may
	be recomputed anyway if the binding is itself too old.  For this
	reason, the bindings are themselves timestamped.  (And so are
	the sessions.)

	(For anyone wishing to try this version, its sources could be
	found at [1, 2].)


    Trying a bunch of ASN.1-based files instead

	Unfortunately, despite my attempts to implement a
	space-efficient database schema, the volume of the resulting
	database is not only enormous by itself (about 360 bytes per
	file being tracked for the first session) but also grows rather
	rapidly (about 86 bytes per file being tracked per session.)

	Also, as the database grows, the addition of new records begins
	to take growing amounts of time.  For the third session of the
	database tracking about 300000 files, the rate could be as slow
	as a few regular files per second on a fairly decent hardware.
	Indeed, such a session may take a whole day to complete!

	This, combined with that there seem to be no free software
	extensions for transparent SQLite database file compression,
	prompted me to change my strategy.

	Namely, I've decided that instead of using a single database
	file for all the sessions, I'd use one binary "log" file per
	session, while allowing the tool to both read the files
	containing previous sessions (so to compute digests only for the
	new and changed files) and to /reference/ them from within the
	resulting file (so to conserve space.)

	As there's no need anymore to allow concurrent access to a
	single file (files containing previously recorded sessions can
	safely be examined while the new file is being written), the
	in-file database indices are becoming less relevant.

	Additionally, as the files in question are to be read mainly
	sequentially, it becomes feasible to compress them even when
	they're still used.  Also, as there're distinct files for
	distinct sessions, it's easy to compress those no longer needed,
	or move them to some cheaper storage (slow HDD's and DVD+R's.)

	Now, to the numbers.  For the test run, I've processed 47151
	files (41285 regular, 2.6 GiB in total), and it took only 6:21
	to produce the corresponding 251978 records (10 MiB in total.)
	The time is comparable to raw sha1sum(1) (about 4:20), and the
	space demands are roughly 222 bytes per file on average (compare
	to the SQLite-based version's 360 above.)

I: file_bind=251978, file_rec=251977, digest=251976
132.12user 13.32system 6:21.35elapsed 38%CPU (0avgtext+0avgdata 233008maxresident)k
4096210inputs+8outputs (4major+14613minor)pagefaults 0swaps

	Unfortunately, the use of the previously recorded sessions is
	not implemented at this moment, so I don't have the space
	requirements for the "incremental update" case at hand, but I
	expect them to be times lower than those for the SQLite-based
	version.  Combined with the ability to easily compress or move
	away the older sessions, I hope this may finally get the tool to
	a usable and useful state.

	(For anyone wishing to try this version, its sources could also
	be found at [1, 2], under the fccs-2012-03-asn.1 branch.)


    Miscellaneous facts

	The software in question is written in Perl and was primarily
	tested on Debian GNU/Linux 6.0 "Squeeze" systems.  It depends on
	a few Perl packages, such as Digest::SHA and UUID, and also on
	either DBD::SQLite and DBI (SQLite-based version), or
	Convert::ASN1 (ASN.1-based one.)

	It's my intention to make the software available under the
	GNU General Public License version 3 or later, and the related
	data schemata will probably be released into the public domain.


    References

[1] http://gray.am-1.org/~ivan/archives/git/gitweb.cgi?p=fc-2012.git
[2] http://gray.am-1.org/~ivan/archives/git/fc-2012.git/

-- 
FSF associate member #7257

[toc] | [next] | [standalone]

#5

From	Ivan Shmakov <oneingray@gmail.com>
Date	2012-03-19 18:03 +0700
Message-ID	<86fwd43lai.fsf@gray.siamics.net>
In reply to	#4

>>>>> Ivan Shmakov <oneingray@gmail.com> writes:

	[Cross-posting to news:comp.software.testing, too.]

[...]

 > Now, to the numbers.  For the test run, I've processed 47151 files
 > (41285 regular, 2.6 GiB in total), and it took only 6:21 to produce
 > the corresponding 251978 records (10 MiB in total.)

	More precisely, 251978 (or, rather, 251979) is the number of
	/object definition/ records only.

	The distribution of the record types is as follows:

 251979 objectDef
  48253     blob
  47192     filename
  47151     fileBind
  42883     solidStat
  42883     fileRecord
  23615     digest
      1     uuid
      1     digestKind
  47151 objectConfirm
   2559 filePadding
    381 timestamp
      1 sessionHeader
      1 sessionTrailer

	(Please also note that for the testing purposes, the block
	padding size was set to 4 KiB, which is a rather low value.  I
	guess that the reasonable default for the released version of
	the code would be around 128 KiB, and the number of filePadding
	records will be proportionally lower.)

 > The time is comparable to raw sha1sum(1) (about 4:20), and the space
 > demands are roughly 222 bytes per file on average (compare to the
 > SQLite-based version's 360 above.)

 > I: file_bind=251978, file_rec=251977, digest=251976
 > 132.12user 13.32system 6:21.35elapsed 38%CPU (0avgtext+0avgdata 233008maxresident)k
 > 4096210inputs+8outputs (4major+14613minor)pagefaults 0swaps

 > Unfortunately, the use of the previously recorded sessions is not
 > implemented at this moment, so I don't have the space requirements
 > for the "incremental update" case at hand, but I expect them to be
 > times lower than those for the SQLite-based version.  Combined with
 > the ability to easily compress or move away the older sessions, I
 > hope this may finally get the tool to a usable and useful state.

	And here they are:

I: loaded, id-start=0, id-next=251979
I: session
...
I: file_bind=251978, file_rec=251977, confirmed
128.16user 4.96system 2:37.34elapsed 84%CPU (0avgtext+0avgdata 2020096maxresident)k
35824inputs+0outputs (0major+127316minor)pagefaults 0swaps

	This time, only 207 digests were computed (preasumably for the
	new and changed files in the set), and most of the time was
	seemingly spent reading the "previous" session.

	The resulting file is less than 774 KiB, contains 61858 records
	total, 47393 of whose are "objectConfirm" ones (whose number
	should be equal to the number of files processed.)  The average
	space consumption is thus less than 17 bytes per file, which is
	a huge win when compared to the 86 bytes per file per session in
	the previous, SQLite-based version.

	The distribution of the record types in this "incremental" file
	is as follows:

  47393 objectConfirm
  14218 objectDef
   6578     fileRecord
   6578     fileBind
    369     blob
    272     solidStat
    243     filename
    177     digest
      1     file
    193 filePadding
     51 timestamp
      1 objectMapDef
      1 sessionHeader
      1 sessionTrailer

	The objectDef records here should be mainly fileRecord and
	fileBind ones, due to the changes in Unix access times (a-times)
	of the files processed.  Should a-time recording be disabled (or
	the filesystem be mounted with a-time updates disabled), I
	expect much fewer of these records.

 > (For anyone wishing to try this version, its sources could also be
 > found at [1, 2], under the fccs-2012-03-asn.1 branch.)

	... Though I'm yet to upload the recent changes there.

[...]

 > [1] http://gray.am-1.org/~ivan/archives/git/gitweb.cgi?p=fc-2012.git
 > [2] http://gray.am-1.org/~ivan/archives/git/fc-2012.git/

-- 
FSF associate member #7257

[toc] | [prev] | [next] | [standalone]

#6

From	Ivan Shmakov <oneingray@gmail.com>
Date	2012-03-20 23:37 +0700
Message-ID	<86wr6fz0s5.fsf@gray.siamics.net>
In reply to	#5

	Just a couple more tests.


    Case A: 37 GiB, 1.58 MiN file archive

$ df -- /mnt 
Filesystem           1K-blocks      Used Available Use% Mounted on
XXXX                  41274541  38857603    320938 100% /mnt
$ df -i -- /mnt 
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
XXXX                 5242880 1661012 3581868   32% /mnt
$ find /mnt -xdev -print0 \
      | LC_ALL=C /usr/bin/time fc-scan -o fc-asn > fc-asn.log 2>&1 
$ tail -n3 < fc-asn.log 
I: file_bind=9817000, file_rec=9816999, digest=9816998
5145.29user 340.74system 3:23:05elapsed 45%CPU (0avgtext+0avgdata 15393312maxresident)k
61162914inputs+0outputs (4768major+1013779minor)pagefaults 0swaps
$ 

	The resulting ASN.1 (BER) "log" file and the program's own text
	log ("debug output") are as follows:

410017823 Mar 20 03:50 fc-asn
292657597 Mar 20 03:50 fc-asn.log

	Which gives roughly 247 bytes and 5.91 records per filesystem's
	inode (or 10.6 bytes and 0.26 records per KiB), at 6.9 MiB and
	303 inodes processed per 1 second CPU time.


    Case B: 15.9 GiB file archive

$ df -- /mnt 
Filesystem           1K-blocks      Used Available Use% Mounted on
XXXX                  24770940  16678072   6834580  71% /mnt
$ find /mnt -xdev -print0 \
      | LC_ALL=C /usr/bin/time fc-scan -o fc-asn > fc-asn.log 2>&1 
$ tail -n3 < fc-asn.log 
I: file_bind=919966, file_rec=919965, undigestible
494.76user 28.93system 20:31.95elapsed 42%CPU (0avgtext+0avgdata 1495824maxresident)k
7575336inputs+0outputs (4major+93549minor)pagefaults 0swaps
$ 

	The resulting ASN.1 (BER) "log" file and the program's own text
	log ("debug output") are as follows:

38083266 Mar 19 23:42 fc-asn
24922215 Mar 19 23:42 fc-asn.log

	Which gives roughly 2.28 bytes and 0.06 records per KiB, at
	31.1 MiB processed per 1 second CPU time.

-- 
FSF associate member #7257

[toc] | [prev] | [next] | [standalone]

#7

From	Ivan Shmakov <oneingray@gmail.com>
Date	2012-03-22 11:41 +0700
Message-ID	<86fwd1xn6f.fsf@gray.siamics.net>
In reply to	#5

>>>>> Ivan Shmakov <oneingray@gmail.com> writes:
>>>>> Ivan Shmakov <oneingray@gmail.com> writes:

[...]

 >> Unfortunately, the use of the previously recorded sessions is not
 >> implemented at this moment, so I don't have the space requirements
 >> for the "incremental update" case at hand, but I expect them to be
 >> times lower than those for the SQLite-based version.  Combined with
 >> the ability to easily compress or move away the older sessions, I
 >> hope this may finally get the tool to a usable and useful state.

 > And here they are:

[...]

	One thing to note, however, is that there's no proper support
	for "incrementally updating" a session which is in turn an
	"incremental update."  IOW, if one records session A, and then
	records an "update" B, it's not currently possible to use both A
	and B combined as the basis for a further update C.

	As a work-around, the C session may be recorded as an update to
	A instead.  (Thus getting all the files changed or added since A
	rehashed for both B, C, and all the subsequent sessions based on
	A.)  Like:

    A --> B --> C --> D         currently not implemented;

    A                           possible, though not as time and space
    |--> B                      efficient.
    |--> C
    `--> D

	The support for such a mode of operation shouldn't be that hard
	to implement, but I wish to make some clean-ups to the code
	before that.  (In particular: split the code into separate Perl
	modules, commit the "dump" utility source to the Git repository,
	provide a CPAN package, allow for a-times and other not that
	important metadata to be omitted from the record.)

 >> (For anyone wishing to try this version, its sources could also be
 >> found at [1, 2], under the fccs-2012-03-asn.1 branch.)

 > ... Though I'm yet to upload the recent changes there.

	Done.

 >> [1] http://gray.am-1.org/~ivan/archives/git/gitweb.cgi?p=fc-2012.git
 >> [2] http://gray.am-1.org/~ivan/archives/git/fc-2012.git/

-- 
FSF associate member #7257

[toc] | [prev] | [standalone]

csiph-web

filesystem content tracking

Contents

#4 — filesystem content tracking

#5

#6

#7