Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'instance,': 0.05; 'pypy': 0.07; 'python': 0.08; 'coding.': 0.09; 'computing': 0.11; 'wed,': 0.12; 'subject:python': 0.12; 'wrote:': 0.15; 'forking': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'okay,': 0.16; 'subject:distributed': 0.16; 'traffic,': 0.16; 'pm,': 0.16; 'aug': 0.19; 'received:209.85.210.174': 0.19; 'received:mail-iy0-f174.google.com': 0.19; 'rewrite': 0.19; 'simpler': 0.19; 'otherwise,': 0.19; 'language': 0.20; 'seconds': 0.21; 'maybe': 0.22; 'header:In-Reply-To:1': 0.22; 'code.': 0.22; 'long.': 0.23; 'code': 0.24; 'string': 0.26; 'candidates': 0.26; 'tried': 0.27; 'changing': 0.28; 'process,': 0.28; 'message- id:@mail.gmail.com': 0.28; 'module': 0.30; 'fun.': 0.30; 'lines': 0.31; 'chris': 0.32; 'to:addr:python-list': 0.34; 'there': 0.34; 'quite': 0.34; "can't": 0.34; 'things': 0.34; 'languages.': 0.35; "isn't": 0.35; 'options': 0.36; 'core': 0.36; 'option': 0.37; 'some': 0.37; 'but': 0.37; 'could': 0.37; 'received:google.com': 0.38; 'received:209.85': 0.38; 'subject:: ': 0.38; 'execution': 0.38; 'think': 0.38; 'two': 0.38; 'data': 0.39; 'ways': 0.39; "there's": 0.39; 'to:addr:python.org': 0.39; 'might': 0.39; 'received:209': 0.40; 'where': 0.40; 'raw': 0.40; 'your': 0.60; 'computers,': 0.67; 'profile': 0.71; '100%': 0.82; 'counting.': 0.84; 'defeat': 0.84; 'methods?': 0.84; 'need,': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=RVXTj9V8S5Ho0d9y98ZMsH2YGVk8ClXkobpVQ2DcfUs=; b=FiNbQe4KUVB/TvfbQjRtqfh+MQIVKC3XhttyUK7gF+dsdjd0M7VIee06nnYY1ouEoW oTfrGjG5ndpu+mtq3EU0xJ2EWp16fPr0gmxP39Z9y+SDPbIY5ABHR+D9jX+OAUdRya+q Oyn/ObFoUxs1TKewlKoKcelXhiqJDdi9g9hM4= MIME-Version: 1.0 In-Reply-To: References: Date: Wed, 3 Aug 2011 16:51:20 +0100 Subject: Re: python distributed computing From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 37 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1312386683 news.xs4all.nl 23837 [2001:888:2000:d::a6]:49297 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:10810 On Wed, Aug 3, 2011 at 4:37 PM, julien godin wrote: > I have a HUUUUUGE(everything is relative) amount of data coming from UDP/= 514 > ( good old syslog), and one process alone could not handle them ( it's > full=A0of regex, comparison and counting. all this at about 100K lines of= data > per seconds and each line is about 1500o long. I tried to handle it on on= e > process, 100% core time in htop. ) > So i said to myself : why not distribute the computing ? You could brute-force it by forking out to multiple computers, but that would entail network traffic, which might defeat the purpose. If you have a multi-core CPU or multi-CPU computer, a better option would be to look into the multiprocessing module for some simple ways to divide the work between cores/CPUs. Otherwise, there's a few things to try. First and most important thing: Optimize your algorithms! Can you do less work and still get the same result? For instance, can you replace the regex with two or three simpler string methods? Profile your code to find out where most of the time is being spent, and see if you can recode those parts. Second: Try PyPy or Cython for higher-performance Python code. And third, if you still can't get the performance you need, consider changing languages. Python isn't the best language for fast execution - it's good for fast coding. Rewrite some or all of your code in C, Pike, COBOL, raw assembly language... okay, maybe not those last two! Since you profiled your code up in step 1, you'll know which parts are the best candidates for C code. I think there are quite a few options better than forking across computers; although distributed computing IS a lot of fun. Chris Angelico