Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.009 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; '16,': 0.03; 'cache': 0.07; 'suddenly': 0.07; 'pep': 0.09; 'stale': 0.09; 'windows,': 0.09; 'cc:addr:python-list': 0.11; '(other': 0.16; 'backward': 0.16; 'benjamin': 0.16; 'blocking': 0.16; 'caches': 0.16; 'caching': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'jumped': 0.16; 'jumps': 0.16; 'mtime': 0.16; 'naming': 0.16; 'sorts': 0.16; 'up-to-date,': 0.16; 'comment:': 0.16; 'fix': 0.17; 'wrote:': 0.18; 'slightly': 0.19; 'things.': 0.19; 'platforms': 0.22; 'cc:addr:python.org': 0.22; 'instance,': 0.24; 'cc:2**0': 0.24; 'header:In-Reply-To:1': 0.27; 'am,': 0.29; 'message- id:@mail.gmail.com': 0.30; "i'm": 0.30; 'clock': 0.31; 'fast.': 0.31; 'firewall': 0.31; 'file': 0.32; 'linux': 0.33; 'running': 0.33; 'fri,': 0.33; 'maybe': 0.34; "i'd": 0.34; 'could': 0.34; "can't": 0.35; 'info': 0.35; 'knows': 0.35; 'possible.': 0.35; 'computing': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'ram': 0.36; 'subject:?': 0.36; 'half': 0.37; 'two': 0.37; 'easily': 0.37; 'performance': 0.37; 'being': 0.38; 'problems': 0.38; 'issue': 0.38; 'that,': 0.38; 'aside': 0.39; "you're": 0.61; 'first': 0.61; 'more': 0.64; 'side': 0.67; 'collection.': 0.84; 'gains': 0.84; 'hour,': 0.84; 'listings,': 0.84; 'nice,': 0.84; 'onboard': 0.84; 'stat': 0.84; 'hardest': 0.91; 'notable': 0.91; 'on?': 0.91; 'to:none': 0.92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc :content-type; bh=PjznU72EY7jP2U8CKeaoUr0SjeU4ImmOjAOCSqr7Mjo=; b=Xql1OJyQGsbqISS+ziQdF7o71LswjjxG81kwrf6UK87QFsg16rwfA49dBLVuworEe1 iIW3v5t5S/UKSedEs2YIE3JSDaTuImAF6A8gt28jKeSnIPt8SYkeYYkSnOmZuEH3MtZ3 4ss6bNiNC6F6pt/JX/stMKBHd97NApD+YbumJILgGDSEN1sY90aebleWJtJFyEX84nfE 41fJO2tT7nZ3+lQ25kCpw+CbYB7E/K/UE7ePt0QWnwXTJXEkY4MuG1p8PfyHcif/qNjk z3L7rNwa3UZnwIcvaU/uwQ0kP9n8dSAlxkGMoxF+IGLqKQ0Kk+7BglwoBHKaJR++CsmG Jrcg== MIME-Version: 1.0 X-Received: by 10.220.81.194 with SMTP id y2mr3264835vck.29.1400183345032; Thu, 15 May 2014 12:49:05 -0700 (PDT) In-Reply-To: References: Date: Fri, 16 May 2014 05:49:04 +1000 Subject: Re: Directory Caching, suggestions and comments? From: Chris Angelico Cc: python-list Content-Type: text/plain; charset=UTF-8 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 33 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1400183354 news.xs4all.nl 2958 [2001:888:2000:d::a6]:43354 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:71623 On Fri, May 16, 2014 at 5:34 AM, Benjamin Schollnick wrote: > Just as a side note, I'm not completely PEP 8. I know that, I use a > slightly laxer setting in pylint, but I'm working my way up to it... > > I am using scandir from benhoyt to speed up the directory listings, and data > collection. First comment: You're running headlong into the two hardest problems in computing - cache invalidation, and naming things. (And off-by-one errors.) More specifically, and leaving aside the naming issue as you're aware of it, you have to cope with all sorts of messes of stale cache data. For instance, you stat a directory and depend on its mtime - you can't depend on that always being up-to-date, AND you can't rely on the clock not shifting. (What happens, for instance, if the server's onboard clock gains time at a notable rate, and a firewall misconfiguration is blocking NTP - and then you fix the firewall and the clock suddenly jumps backward by a few hours? Yep. Happened to me. Well, I think the clock jumped maybe half an hour, but it could easily have been a lot more.) What platform are you running this on? On all my Linux systems, the file system caches stat() info for me. That has never been a problem, because the FS knows when it needs to update/flush that cache. All I need to know is that having spare RAM means performance improves :) I can do a "sudo find / -name ..." and it chugs and chugs, and then I do it again and it's fast. Windows, not so nice, but I'd still look at OS or FS caching where possible. (Other platforms I don't personally use, so I don't know whether or not they have good caching.) ChrisA