Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.006 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'cache': 0.05; 'chunk': 0.07; 'over,': 0.07; 'subject:file': 0.07; 'terminated': 0.07; 'subject:How': 0.09; 'cc:addr:python-list': 0.10; 'disk.': 0.16; 'operation.': 0.16; 'storing': 0.16; 'threaded': 0.16; 'top- level': 0.16; 'writes.': 0.16; 'mon,': 0.16; 'wrote:': 0.17; 'sender:addr:gmail.com': 0.18; 'received:mail-bk0-f46.google.com': 0.22; 'cc:2**0': 0.23; 'cc:no real name:2**0': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'am,': 0.27; 'received:209.85.214.46': 0.27; 'message- id:@mail.gmail.com': 0.27; 'queue': 0.29; 'url:mailman': 0.29; 'maybe': 0.29; 'url:python': 0.32; 'file': 0.32; 'url:listinfo': 0.32; 'purposes,': 0.33; 'point.': 0.33; 'received:google.com': 0.34; 'saved': 0.35; 'doing': 0.35; 'continue': 0.35; 'received:209.85': 0.35; 'there': 0.35; 'next': 0.35; 'michael': 0.36; 'url:org': 0.36; 'too': 0.36; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'store': 0.38; 'several': 0.39; 'performance': 0.39; 'received:209.85.214': 0.39; 'header:Received:5': 0.40; 'url:mail': 0.40; 'your': 0.60; 'times': 0.63; 'managing': 0.64; 'choose': 0.65; 'results': 0.65; 'jul': 0.65; 'middle': 0.66; 'subject:status': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=d4SsE4qcjkQwJCrSvmMsm2qnmu7w+BCl9eoJXN64fTw=; b=vXay7cLRmfumOn8Y9aXVk0YPHkw4+gUdK2rOS8SUlUIlja3O7d4ocPSQyY+EK7SgNh EzIxk0XLqpN7vMXE5jfRDnFY6gRjTTdj/9Y9xSKxgqDknTrVzqi9ATebpEvkt8/M4N5f p1xGYulT8ixzEjJJfbUrMIw1FzQtCjVNCTMGCZgF376I1TP6UlgTJUk3ckP4K14XKofM 7D26k5RvM6X5XMAn7dmyLXUlNr9YcUkc73mGO4djnwHHPyLfqRRNE5qXWD9PdXeKd4Is ZnIKxPe9ueEQMzcfAZWrbcbXRtC5BrhF1Cjqr9fkmR5QwJuyri+xzfFE6+4USaFLwo+S cxxw== MIME-Version: 1.0 Sender: mhrivnak@gmail.com In-Reply-To: References: Date: Mon, 9 Jul 2012 16:47:22 -0400 X-Google-Sender-Auth: qU5kfH5C0ah34SimhawqtgQX6dI Subject: Re: How to safely maintain a status file From: Michael Hrivnak To: Plumo Content-Type: text/plain; charset=ISO-8859-1 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 28 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1341866844 news.xs4all.nl 6905 [2001:888:2000:d::a6]:56327 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:25097 Please consider batching this data and doing larger writes. Thrashing the hard drive is not a good plan for performance or hardware longevity. For example, crawl an entire FQDN and then write out the results in one operation. If your job fails in the middle and you have to start that FQDN over, no big deal. If that's too big of a chunk for your purposes, perhaps break each FQDN up into top-level directories and crawl each of those in one operation before writing to disk. There are existing solutions for managing job queues, so you can choose what you like. If you're unfamiliar, maybe start by looking at celery. Michael On Mon, Jul 9, 2012 at 1:52 AM, Plumo wrote: >> What are you keeping in this status file that needs to be saved >> several times per second? Depending on what type of state you're >> storing and how persistent it needs to be, there may be a better way >> to store it. >> >> Michael > > This is for a threaded web crawler. I want to cache what URL's are > currently in the queue so if terminated the crawler can continue next > time from the same point. > -- > http://mail.python.org/mailman/listinfo/python-list