Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; '"""': 0.07; 'attribute': 0.07; 'cache': 0.07; 'method.': 0.07; 'nicely': 0.07; 'rename': 0.07; 'welcome.': 0.07; '(best': 0.09; 'assigning': 0.09; 'check,': 0.09; 'filename': 0.09; 'filenames': 0.09; 'function,': 0.09; 'function:': 0.09; 'iterate': 0.09; 'objects,': 0.09; 'stale': 0.09; 'url:github': 0.09; 'window.': 0.09; 'cc:addr:python-list': 0.11; 'def': 0.12; "wouldn't": 0.14; 'changes': 0.15; '"in"': 0.16; '@property': 0.16; 'attributes.': 0.16; 'benjamin': 0.16; 'brackets.': 0.16; 'cached': 0.16; 'caches': 0.16; 'class:': 0.16; 'defined.': 0.16; 'dictionary.': 0.16; 'elsewhere,': 0.16; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'ignoring': 0.16; 'in-memory': 0.16; 'insensitive': 0.16; 'iterating': 0.16; 'mangled': 0.16; 'message-id:@cskk.homeip.net': 0.16; 'mtime': 0.16; 'naming': 0.16; 'objects.': 0.16; 'optional': 0.16; 'personally,': 0.16; 'portable': 0.16; 'readable': 0.16; 'received:cskk.homeip.net': 0.16; 'received:homeip.net': 0.16; 'simpson': 0.16; 'skip:> 50': 0.16; 'stable.': 0.16; 'stuff.': 0.16; 'term.': 0.16; 'tuples,': 0.16; 'um.': 0.16; 'unix,': 0.16; 'write,': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'library': 0.18; 'bit': 0.19; 'module': 0.19; 'result.': 0.19; 'typing': 0.19; 'usability': 0.19; 'later': 0.20; 'input': 0.22; 'memory': 0.22; 'platforms': 0.22; 'example': 0.22; 'import': 0.22; 'saying': 0.22; 'cc:addr:python.org': 0.22; 'header:User-Agent:1': 0.23; 'browsers': 0.24; 'convenient': 0.24; 'directory.': 0.24; 'earlier': 0.24; 'cheers,': 0.24; '(or': 0.24; 'cc:2**0': 0.24; 'class.': 0.26; 'define': 0.26; 'skip:" 30': 0.26; 'pass': 0.26; 'least': 0.26; 'skip:" 20': 0.27; 'skip:_ 20': 0.27; 'header:In- Reply-To:1': 0.27; 'point': 0.28; 'function': 0.29; 'correct': 0.29; 'points': 0.29; 'unix': 0.29; 'needed.': 0.30; 'strongly': 0.30; "i'm": 0.30; 'code': 0.31; 'comments': 0.31; 'that.': 0.31; 'too.': 0.31; 'usually': 0.31; 'clock': 0.31; 'everywhere': 0.31; 'indentation': 0.31; 'lot.': 0.31; 'modified,': 0.31; 'names.': 0.31; 'object.': 0.31; 'though.': 0.31; 'file': 0.32; 'class': 0.32; 'this.': 0.32; 'probably': 0.32; '>from': 0.68; 'below.': 0.71; 'therefore': 0.72; 'obvious': 0.74; 'special': 0.74; 'goal': 0.75; 'yourself': 0.78; 'low': 0.83; '"not': 0.84; 'avoids': 0.84; 'collection.': 0.84; 'comment.': 0.84; 'complaint': 0.84; 'everywhere.': 0.84; 'feedback,': 0.84; 'listings,': 0.84; 'received:192.168.15': 0.84; 'remark': 0.84; 'remarks': 0.84; 'remembering': 0.84; 'reply-to:addr:python.org': 0.84; 'short,': 0.84; 'stat': 0.84; 'underneath': 0.84; 'witness': 0.84; 'your:': 0.84; 'habit': 0.91; 'prone': 0.91; 'these.': 0.91; 'directly.': 0.95; 'serious': 0.97 Date: Fri, 16 May 2014 10:25:20 +1000 From: Cameron Simpson To: Benjamin Schollnick Subject: Re: Directory Caching, suggestions and comments? MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) References: Cc: python-list X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: python-list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 277 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1400199941 news.xs4all.nl 2965 [2001:888:2000:d::a6]:56080 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:71634 On 15May2014 15:34, Benjamin Schollnick wrote: >I am going to be using this code as part of a web system, and I would love >any feedback, comments and criticism. [...] >I am using scandir from benhoyt to speed up the directory listings, and >data collection. [...] >I had considered using OrderedDicts, but I really didn't see how that would >help the system. > >I'm not completely happy with the return_sort_* functions, since they >return two different tuples, one goal was to try to keep everything in the >dictionary, but I couldn't think of a better method. > >So any suggestions are welcome. Firstly, I'm with ChrisA on the whole: let the filesystem/OS do the stat caching. It _is_ actually slower to rely on this than an in-memory cache such as yours, but at least it is reliable because you will then have exactly the same file info state as any other user of the OS. Stale caches are a usability PITA for the end user; witness the continued presence of "Shift-Reload" in most browsers because browser (and proxy) caches get stale. His remark about relying on the mtime of a directory is correct also, and not just because of clock changes. On UNIX, the mtime of a directory changes _only_ when a filename is added/removed from the directory. To the point, it will not change if a file in the directory is just modified, and therefore your cache will become stale (and stay stale) in that circumstance. Of course, you wouldn't even be bothering with scandir if you were not concerned about relying of the filesystem/OS. So, it is your call about remembering this stuff. I would include a real-time staleness in addition to the mtime check, eg if the mtime is updated _or_ the system time says it is more than N seconds since the last directory scan, with N being smallish. Most of my other remarks have to do with style and implementation. >Preqs - > Scandir - https://github.com/benhoyt/scandir > scandir is a module which provides a generator version of > os.listdir() that also exposes the extra file information the > operating system returns when you iterate a directory. Personally, my habit is to use os.listdir() and accept the memory cost for two reasons: it is simple and the list is stable. If you iterate over a directory (eg via the C library readdir() facility) to my mind there is scope for the directory changing underneath you. A bit like iterating over a dictionary which is in active use by other code. Using os.listdir() minimises that window. So I'd be going: names = list(scandir(directory_path)) to get a snapshot, and then working with that list. >from stat import ST_MODE, ST_INO, ST_DEV, ST_NLINK, ST_UID, ST_GID, \ > ST_SIZE, ST_ATIME, ST_MTIME, ST_CTIME You don't need these. os.lstat() and os.stat() return stat objects, and they have convenient .st_mtime etc attributes. So all your: data["st_nlink"] = st[ST_NLINK] stuff can become: data["st_nlink"] = st.st_nlink I'd also be making your "data" object just an object, and assigning directory to an attribute on it, thus: data.st_nlink = st.st_nlink Much more readable. You could also avoid the entire issue of manually copying each st.st_blah attribute by just going: data.st = st Then just reach for data.st.st_mtime (or whatever) as needed. Short, quick, avoids needing to worry about optional stat fields - just keep the original stat result. >import time >import scandir > >plugin_name = "dir_cache" > >##################################################### >class CachedDirectory(object): > """ > For example: > > To be added shortly. It is _well_ work sketching this part out. Since this is just a cache, an example should be quite short. I find it enormously convenient to sketch some of the use cases here - it helps keep the required method names nicely defined. i.e. sketch a little code that you need to write, then write methods to support the code. You get a much better match that way. I often revisit classes when I use them, because I write some code using the class and think "that was very painful, how would I _like_ to have been able to use this"? > """ > def __init__(self): > self.files_to_ignore = ['.ds_store', '.htaccess'] > self.root_path = None You don't seem to use .root_path? > # This is the path in the OS that is being examined > # (e.g. /Volumes/Users/username/) > self.directory_cache = {} This is a CachedDirectory class. I would have just called this .cache - it will make all the code using it less wordy, and hopefully more readable. Unless there may be caches of other stuff too. This is almost entirely a matter of personal style though. > def _scan_directory_list(self, scan_directory): > """ > Scan the directory "scan_directory", and save it to the > self.directory_cache dictionary. > > Low Level function, intended to be used by the populate >function. > """ > scan_directory = os.path.abspath(scan_directory) > directories = {} > files = {} > self.directory_cache[scan_directory.strip().lower()] = {} > self.directory_cache[scan_directory.strip().lower()]["number_dirs"] = 0 You write "scan_directory.strip().lower()". It if probably worth going: norm_directory = scan_directory.strip().lower() and saying "norm_directory" throughout. Also, you say: self.directory_cache[scan_directory.strip().lower()] You can also bind: cache = self.directory_cache[norm_directory] right after binding "norm_directory", and say "cache" throughout instead of the longer term. It is naming the same object. >self.directory_cache[scan_directory.strip().lower()]["number_files"] = 0 > for x in scandir.scandir(scan_directory): > st = x.lstat() > data = {} As mentioned earlier, you could usefully define a tiny class: class Dirent(object): pass and go: data = Dirent() which will allow you to replace: > data["fq_filename"] = os.path.realpath(scan_directory).lower() + \ > os.sep+x.name.strip().lower() with: data.fq_filename = os.path.realpath( os.path.join(norm_directory, x.name.strip().lower()) ) Note also the use of os.path.join: it is portable across OSes (eg UNIX versus Windows). Also, well worth defining: norm_name = x.name.strip().lower() and just saying "norm_name" throughout. More readable, less prone to typing accidents. Um. A serious point here: why do you go .strip().lower() to all your filenames and pathnames? On a real UNIX system that will usually mean a different object. Many other platforms may have case insensitive names, but will still complaint about names with and without whitespace as different objects. I would even expect os.path.realname to be unreliable if handed such mangled names. If you are sanitising some input I would recommend doing that elsewhere, and not binding some much special knowledge into this class. Maybe you have good reasons for that. At the least those reasons need to be expressed in comments (best in the opening docstring), because otherwise I would recommend strongly to make the cache class as naive about names as possible (i.e. do not mangle them). > data["parentdirectory"] = os.sep.join(\ > os.path.split(scan_directory)[0:-1]) Use os.path.dirname. > data["st_mode"] = st[ST_MODE] > data["st_inode"] = st[ST_INO] > data["st_dev"] = st[ST_DEV] > data["st_nlink"] = st[ST_NLINK] > data["st_uid"] = st[ST_UID] > data["st_gid"] = st[ST_GID] > data["compressed"] = st[ST_SIZE] "compressed" is worth a comment. > data["st_size"] = st[ST_SIZE] #10 > data["st_atime"] = st[ST_ATIME] #11 > data["raw_st_mtime"] = st[ST_MTIME] #12 > data["st_mtime"] = time.asctime(time.localtime(st[ST_MTIME])) I strongly recommend against this. Keep "st_mtime" (or data.st.st_mtime per my earlier suggestion) as the raw mtime. Any human readable transcription of the time should get a name like "human_st_mtime". Better still, if you make a Dirent class for "data", you can give it a property like: @property def human_st_mtime(self): return time.asctime(time.localtime(self.st.st_mtime)) and just use ".human.st.mtime" like any other attribute later when needed. Note: no brackets. > data["st_ctime"] = st[ST_CTIME] > if not x.name.strip().lower() in self.files_to_ignore: Personally I phrase this as "a not in b" instead of "not a in b". Also, if you really are ignoring it, you could go: if norm_name in self.files_to_ignore: # do nothing else with this entry continue and save yourself some indentation below. Not everyone likes this. > def directory_in_cache(self, scan_directory): > """ > Pass the target directory > > Will return True if the directory is already cached > Will return False if the directory is not already cached > """ > scan_directory = os.path.realpath(scan_directory).lower().strip() Probably worth making normalisation function: def norm_name(name): return name.lower().strip() becuase you do this a lot. Using a function ensures you do it the same way everywhere and makes it obvious that it is the same thing with the same purpose everywhere. > return scan_directory in self.directory_cache.keys() You can just say: return scan_directory in self.directory_cache and for added points you can rename "directory_in_cache" to "__contains__". That way from outside (or inside) you use "in" directly. Example: if "foo" in self: Cheers, Cameron Simpson