Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #71622

Directory Caching, suggestions and comments?

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.mixmin.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <bschollnick@schollnick.net>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'else:': 0.03; 'read.': 0.03; 'skip:[ 20': 0.04; '"""': 0.07; 'cache': 0.07; 'method.': 0.07; 'modified': 0.07; 'welcome.': 0.07; 'converted': 0.09; 'false,': 0.09; 'function,': 0.09; 'iterate': 0.09; 'os.path': 0.09; 'pep': 0.09; 'skip:o 50': 0.09; 'url:github': 0.09; 'wrapper': 0.09; 'def': 0.12; "(it's": 0.16; 'benjamin': 0.16; 'cached': 0.16; 'caching': 0.16; 'dictionary,': 0.16; 'dictionary.': 0.16; 'folks,': 0.16; 'modified.': 0.16; 'skip:[ 30': 0.16; 'skip:[ 50': 0.16; 'time.time()': 0.16; 'tuples,': 0.16; 'url:py': 0.16; '\xc2\xa0-': 0.16; '\xc2\xa0i': 0.16; 'module': 0.19; 'slightly': 0.19; '8bit%:5': 0.22; 'import': 0.22; 'directory.': 0.24; 'exists': 0.24; '(or': 0.24; '&gt;': 0.26; 'pass': 0.26; 'skip:_ 20': 0.27; '----': 0.29; 'generally': 0.29; 'scanned': 0.29; "doesn't": 0.30; '8bit%:3': 0.30; 'message- id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; "i'm": 0.30; 'code': 0.31; 'comments': 0.31; 'directory,': 0.31; 'relies': 0.31; 'skip:s 70': 0.31; 'tuples': 0.31; 'update.': 0.31; 'file': 0.32; 'class': 0.32; 'checked': 0.32; 'quite': 0.32; '(e.g.': 0.33; 'skip:t 40': 0.33; 'to:name:python-list': 0.33; 'updated': 0.34; 'skip:_ 10': 0.34; 'skip:d 20': 0.34; 'skip:s 30': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'version': 0.36; 'really': 0.36; '8bit%:17': 0.36; '8bit%:9': 0.36; 'date.': 0.36; 'false': 0.36; 'functions.': 0.36; 'library.': 0.36; 'skip:s 60': 0.36; "didn't": 0.36; 'subject:?': 0.36; 'operating': 0.37; 'skip:- 20': 0.37; 'two': 0.37; 'level': 0.37; 'being': 0.38; 'skip:o 20': 0.38; 'system,': 0.38; 'skip:& 10': 0.38; '8bit%:4': 0.38; 'form,': 0.38; 'skip:[ 10': 0.38; 'to:addr:python-list': 0.38; 'files': 0.38; 'that,': 0.38; 'skip:& 20': 0.39; 'does': 0.39; "couldn't": 0.39; 'to:addr:python.org': 0.39; 'changed': 0.39; 'system.': 0.39; '8bit%:6': 0.40; 'skip:x 10': 0.40; 'how': 0.40; 'ensure': 0.60; 'read': 0.60; 'skip:n 30': 0.60; 'skip:\xc2 10': 0.60; 'skip:o 30': 0.61; 'skip:t 30': 0.61; 'url:u': 0.61; 'save': 0.62; 'times': 0.62; 'information': 0.63; '8bit%:10': 0.64; 'different': 0.65; 'love': 0.65; 'here': 0.66; 'side': 0.67; '2-3': 0.68; '8bit%:31': 0.68; 'date,': 0.68; '\xc2\xa0\xc2\xa0': 0.74; 'goal': 0.75; 'url:x': 0.81; 'low': 0.83; '*and*': 0.84; '8bit%:16': 0.84; 'collection.': 0.84; 'feedback,': 0.84; 'listings,': 0.84; 'stat': 0.84; '-\xc2\xa0': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=schollnick.net; s=schollnick; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=k3RciskZzS/zEqZXUmBGV1pJJOUIY6GyZN54me1cZGc=; b=Bj+jQWETQRFa40pKjKKQp3G+/fKznDIJOZ6SAsLOij0KHvRhpVrHnDxND1u1ifYAhf UiKsF3avJ7TxS+KFfdg1zAamM8uYoLlrAyGM23/DKdFTKOUpRBvKtFQGmhYbkoegORaz SGKkxsxpORXoVy7vCjmcPc7VQAUp+Z0uIKyTU=
X-Google-DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:date:message-id:subject:from :to:content-type; bh=k3RciskZzS/zEqZXUmBGV1pJJOUIY6GyZN54me1cZGc=; b=bs9mZ8Y8adid/KFrfOFoivi854zBFlnv0pVA7rNs1FNc4lsFoZ417fY+Q6+XDkyQkN wb6MEpkDDa3qVzRGWA66wAT8YngbP996305AYz4a8unOQz7kmGJ91796HtcE+EAoRn78 11UTKGf9O69TiwgxRLwV8GIMtq5aZF8Wpv8crO40gk2+N7TYE2xVYYoZLDjkeJyUDVGN 8PIL1wPv+O0MvXI/6jCv1FGtuVz6xzk9kddicjjMt7bSxU8Zv3uMUtwIUxXhzn57IET/ R9lsvgPfzh5XTEcHBShuK2L8x6MzAoNlyIwQlx5gxUvXFynNzENbFUGtgvM4S9lLcVYN fo/Q==
X-Gm-Message-State ALoCoQnmSsrfthLt+dzLJ6LxDeGuFtbLK8AXvaY1qUQZUdC2GSv91M28idYAupKeOULN7aANFi0G
MIME-Version 1.0
X-Received by 10.50.22.37 with SMTP id a5mr76125405igf.30.1400182491022; Thu, 15 May 2014 12:34:51 -0700 (PDT)
Sender bschollnick@schollnick.net
Date Thu, 15 May 2014 15:34:50 -0400
X-Google-Sender-Auth 6jg7-BVWmJHTfbFEXFhYH96pXhg
Subject Directory Caching, suggestions and comments?
From Benjamin Schollnick <benjamin@schollnick.net>
To python-list <python-list@python.org>
Content-Type multipart/alternative; boundary=047d7b10c991ff9cfd04f9756330
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.10047.1400182501.18130.python-list@python.org> (permalink)
Lines 551
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1400182501 news.xs4all.nl 2907 [2001:888:2000:d::a6]:39718
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:71622

Show key headers only | View raw


[Multipart message — attachments visible in raw view] - view raw

Folks,

I am going to be using this code as part of a web system, and I would love
any feedback, comments and criticism.

Just as a side note, I'm not completely PEP 8.  I know that, I use a
slightly laxer setting in pylint, but I'm working my way up to it...

I am using scandir from benhoyt to speed up the directory listings, and
data collection.

The module is here as well,
https://dl.dropboxusercontent.com/u/241415/misc/directory_caching_v1.py

I had considered using OrderedDicts, but I really didn't see how that would
help the system.

I'm not completely happy with the return_sort_* functions, since they
return two different tuples, one goal was to try to keep everything in the
dictionary, but I couldn't think of a better method.

So any suggestions are welcome.

     - Benjamin

----
"""
    Directory Caching system.

    Used to cache & speed up directory listing.

Preqs -

    Scandir - https://github.com/benhoyt/scandir

    scandir is a module which provides a generator version of
    os.listdir() that also exposes the extra file information the
    operating system returns when you iterate a directory.

    Generally 2-3 (or more) times faster than the standard library.
    (It's quite noticeable!)
"""
import os
import os.path
import re
from stat import ST_MODE, ST_INO, ST_DEV, ST_NLINK, ST_UID, ST_GID, \
                    ST_SIZE, ST_ATIME, ST_MTIME, ST_CTIME

import time
import scandir

plugin_name = "dir_cache"

#####################################################
class   CachedDirectory(object):
    """
    For example:

        To be added shortly.

    """
    def __init__(self):
        self.files_to_ignore = ['.ds_store', '.htaccess']
        self.root_path = None
            # This is the path in the OS that is being examined
            #    (e.g. /Volumes/Users/username/)
        self.directory_cache = {}


    def _scan_directory_list(self, scan_directory):
        """
            Scan the directory "scan_directory", and save it to the
            self.directory_cache dictionary.

            Low Level function, intended to be used by the populate
function.
        """
        scan_directory = os.path.abspath(scan_directory)
        directories = {}
        files = {}
        self.directory_cache[scan_directory.strip().lower()] = {}
        self.directory_cache[scan_directory.strip().lower()]["number_dirs"]
= 0

self.directory_cache[scan_directory.strip().lower()]["number_files"] = 0
        for x in scandir.scandir(scan_directory):
            st = x.lstat()
            data = {}
            data["fq_filename"] = os.path.realpath(scan_directory).lower()
+ \
                    os.sep+x.name.strip().lower()
            data["parentdirectory"] = os.sep.join(\
                    os.path.split(scan_directory)[0:-1])
            data["st_mode"] = st[ST_MODE]
            data["st_inode"] = st[ST_INO]
            data["st_dev"] = st[ST_DEV]
            data["st_nlink"] = st[ST_NLINK]
            data["st_uid"] = st[ST_UID]
            data["st_gid"] = st[ST_GID]
            data["compressed"] = st[ST_SIZE]
            data["st_size"] = st[ST_SIZE]       #10
            data["st_atime"] = st[ST_ATIME]     #11
            data["raw_st_mtime"] = st[ST_MTIME] #12
            data["st_mtime"] = time.asctime(time.localtime(st[ST_MTIME]))
            data["st_ctime"] = st[ST_CTIME]
            if not x.name.strip().lower() in self.files_to_ignore:
                if x.is_dir():
                    self.directory_cache[scan_directory.strip().lower()]\
                        ["number_dirs"] += 1
                    data["archivefilename"] = ""
                    data["filename"] = ""
                    data["directoryname"] = x.name.strip().lower()
                    data["dot_extension"] = ".dir"
                    data["file_extension"] = "dir"
                    directories[x.name.lower().strip()] = True
                    self._scan_directory_list(data["fq_filename"])
                    data["number_files"] = self.directory_cache\
                        [data["fq_filename"]]["number_files"]
                    data["number_dirs"] = self.directory_cache\
                        [data["fq_filename"]]["number_dirs"]
                    directories[x.name.lower().strip()] = data
                else:
                    self.directory_cache[scan_directory.strip().lower()]\
                        ["number_files"] += 1
                    data["archivefilename"] = ""
                    data["filename"] = x.name.strip().lower()
                    data["directoryname"] = scan_directory
                    data["dot_extension"] = os.path.splitext\
                        (x.name)[1].lower()
                    data["file_extension"] = os.path.splitext\
                        (x.name)[1][1:].lower()
                    files[x.name.lower().strip()] = data
        self.directory_cache[scan_directory.strip().lower()]["files"] =
files
        self.directory_cache[scan_directory.strip().lower()]\
                ["dirs"] = directories
        self.directory_cache[scan_directory.strip().lower()]\
                ["last_scanned_time"] = time.time()
        return

    def directory_in_cache(self, scan_directory):
        """
            Pass the target directory

            Will return True if the directory is already cached
            Will return False if the directory is not already cached
        """
        scan_directory = os.path.realpath(scan_directory).lower().strip()
        return scan_directory in self.directory_cache.keys()

    def directory_changed(self, scan_directory):
        """
            Pass the target directory as scan_directory.

            Will return True if the directory has changed,
            or does not exist in cache.

            Returns False, if the directory exists in cache, and
            has not changed since the last read.

            This relies on the directory's Modified Time actually
            being updated since the last update.
        """
        if self.directory_in_cache(scan_directory):
            scan_directory =
os.path.realpath(scan_directory).lower().strip()
            st = os.stat(scan_directory)
            return st[ST_MTIME] > self.directory_cache[scan_directory]\
                    ["last_scanned_time"]
        else:
            return True

    def smart_read(self, scan_directory):
        """
        This is a wrapper around the Read and changed functions.

        The scan_directory is passed in, converted to a normalized form,
        and then checked to see if it exists in the cache.

        If it doesn't exist (or is expired), then it is read.

        If it already exists *AND* has not expired, it is not
        updated.

        Net affect, this will ensure the directory is in cache, and
        update to date.
        """
        scan_directory = os.path.realpath(scan_directory).lower().strip()
        if self.directory_changed(scan_directory):
            self._scan_directory_list(scan_directory)


    def return_sort_name(self, scan_directory, reverse=False):
        """
        Return sorted list(s) from the Directory Cache for the
        Scanned directory, sorted by name.

        Returns 2 tuples of date, T[0] - Files, and T[1] - Directories
        which contain the data from the cached directory.
        """
        scan_directory = os.path.realpath(scan_directory).lower().strip()
        files = self.directory_cache[scan_directory]["files"]
        dirs = self.directory_cache[scan_directory]["dirs"]
        sorted_files = sorted(files.items(),
                              key=lambda t: t[1]["filename"],
                              reverse=reverse)
        sorted_dirs = sorted(dirs.items(),
                             key=lambda t: t[1]["directoryname"],
                             reverse=reverse)
        return (sorted_files, sorted_dirs)

    def return_sort_lmod(self, scan_directory, reverse=False):
        """
        Return sorted list(s) from the Directory Cache for the
        Scanned directory, sorted by Last Modified.

        Returns 2 tuples of date, T[0] - Files, and T[1] - Directories
        which contain the data from the cached directory.
        """
        scan_directory = os.path.realpath(scan_directory).lower().strip()
        files = self.directory_cache[scan_directory]["files"]
        dirs = self.directory_cache[scan_directory]["dirs"]
        sorted_files = sorted(files.items(),
                              key=lambda t: t[1]["raw_st_mtime"],
                              reverse=reverse)
        sorted_dirs = sorted(dirs.items(),
                             key=lambda t: t[1]["raw_st_mtime"],
                             reverse=reverse)
        return (sorted_files, sorted_dirs)

    def return_sort_ctime(self, scan_directory, reverse=False):
        """
        Return sorted list(s) from the Directory Cache for the
        Scanned directory, sorted by Creation Time.

        Returns 2 tuples of date, T[0] - Files, and T[1] - Directories
        which contain the data from the cached directory.
        """
        scan_directory = os.path.realpath(scan_directory).lower().strip()
        files = self.directory_cache[scan_directory]["files"]
        dirs = self.directory_cache[scan_directory]["dirs"]
        sorted_files = sorted(files.items(),
                              key=lambda t: t[1]["st_ctime"],
                              reverse=reverse)
        sorted_dirs = sorted(dirs.items(),
                             key=lambda t: t[1]["st_ctime"],
                             reverse=reverse)
        return (sorted_files, sorted_dirs)

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Directory Caching, suggestions and comments? Benjamin Schollnick <benjamin@schollnick.net> - 2014-05-15 15:34 -0400

csiph-web