Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Oscar Benjamin Newsgroups: comp.lang.python Subject: Re: looping and searching in numpy array Date: Mon, 14 Mar 2016 15:22:30 +0000 Lines: 48 Message-ID: References: <77bd470b-cc05-4117-9ed1-6309d7a5633a@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de PMk+BevMiV0YEQfnEE4REw1IuQv3yX/+87cr9IEKzcew== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'below)': 0.07; 'cc:addr :python-list': 0.09; 'lookup': 0.09; 'python': 0.10; 'anyway': 0.11; 'index': 0.13; 'size,': 0.13; "(i'm": 0.16; '2016': 0.16; 'arrays?': 0.16; 'cc:name:python list': 0.16; 'numpy': 0.16; 'occurence': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'subject:array': 0.16; 'subject:looping': 0.16; 'to:addr:web.de': 0.16; 'wrote:': 0.16; 'pointer': 0.18; 'typical': 0.18; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'algorithm': 0.20; 'arrays': 0.22; 'bigger': 0.23; 'header:In-Reply-To:1': 0.24; 'sort': 0.25; 'script': 0.25; "i've": 0.25; 'example': 0.26; 'message- id:@mail.gmail.com': 0.27; 'function': 0.28; 'looks': 0.29; 'hash': 0.29; 'search.': 0.29; 'array': 0.29; "i'm": 0.30; 'option': 0.31; 'another': 0.32; 'table': 0.32; 'run': 0.33; 'though.': 0.33; 'values.': 0.33; 'received:google.com': 0.35; 'problem.': 0.35; 'but': 0.36; 'there': 0.36; 'lines': 0.36; 'received:209.85': 0.36; 'basic': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'received:209': 0.38; 'mean': 0.38; 'data': 0.39; 'sure': 0.39; 'skip:e 20': 0.39; 'where': 0.40; 'ever': 0.60; 'your': 0.60; 'avoid': 0.61; 'here.': 0.62; 'above,': 0.63; 'necessarily': 0.63; 'march': 0.64; 'worth': 0.67; 'oscar': 0.84; 'otten': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=zZGpyeKWyXDdglxEKtpX98iLbiDt+glhi+Tb5Rydft4=; b=mGPx1G+Eax1OdahfcOrUOn8kjO4/aTcct5I1l87XlNUcHkbb1imgqaaPClD46RbQ3R JtLlElWC/bwP3RqfPxAGRkgN6K3QvvnULjRSIk/25qr6S+0WmXajAX5tTBFacLAxhMYW S+t9Pwdpun5/Fr9NRoStyKv38w7uIzKXcABdJSwH7nhLh1j3ASdwwZyxneldpRgrWB5e +Ah/bzXRqYY18pKIqPrzZjhwubCwtVDtIMff6YIWpraNFfabrtvkd9Ht9q79eh6mBeap aVPxHBU4qkjeLDfzPJwUwPcrIc83wk/M0ZCxJU36QCnEjqEpLnA9Ko/BQOwEqFxYIzlM r1JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=zZGpyeKWyXDdglxEKtpX98iLbiDt+glhi+Tb5Rydft4=; b=iB2fZwUXlUFsNMIPMb7FJZtIXtRERUFEY4UbTiATitAdZCWmKYUeHXNSCERiZyuk2u Zii6KIZDHwowpQ2+9QvAr6VWXMpNm+B7cXw6y1ByXzQ3ER7R0AAv3Hy5FR3C9hrtLiE4 t+clzMuK69QzJVdeHaRmJ2bSMBmQjXZtk3AfMG5/vFyxlO051NRKzmz9+YdidF0d01pm tAagxiBHKis8sfQXhCGvXCwE9mrkrEI+tRUGAhiIcv8lAZ3bXveHgQ3JdK7LxUkPnTBd 10Gx1pbaOgdCkfS7zYAmCnVPh5HRGbWukLvwFVP9CQgLS134Occ5sBzpMIajMQt0rDV4 o6Uw== X-Gm-Message-State: AD7BkJLx7GwTn7UYY7oLdQO/4ttogcWc6hH4GR3IFTSaekA8KjpVHZQQlOn2XpHShOV95BhYOEC28817hAJnIg== X-Received: by 10.25.135.8 with SMTP id j8mr8116199lfd.64.1457968969797; Mon, 14 Mar 2016 08:22:49 -0700 (PDT) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104823 On 10 March 2016 at 13:02, Peter Otten <__peter__@web.de> wrote: > Heli wrote: > >> I need to loop over a numpy array and then do the following search. The >> following is taking almost 60(s) for an array (npArray1 and npArray2 in >> the example below) with around 300K values. >> >> >> for id in np.nditer(npArray1): >> newId=(np.where(npArray2==id))[0][0] What are the dtypes of the arrays? And what are the typical sizes of each of them. It can have a big effect on what makes a good solution to the problem. >> Is there anyway I can make the above faster? I need to run the script >> above on much bigger arrays (50M). Please note that my two numpy arrays in >> the lines above, npArray1 and npArray2 are not necessarily the same size, >> but they are both 1d. > > You mean you are looking for the index of the first occurence in npArray2 > for every value of npArray1? > > I don't know how to do this in numpy (I'm not an expert), but even basic > Python might be acceptable: I'm not sure that numpy has any particular function that can be of use here. Your approach below looks good though. > lookup = {} > for i, v in enumerate(npArray2): > if v not in lookup: > lookup[v] = i Looking at this I wondered if there was a way to avoid the double hash table lookup and realised it's the first time I've ever considered a use for setdefault: for i, v in enumerate(npArray2): lookup.setdefault(i, v) Another option would be to use this same algorithm in Cython. Then you can access the ndarray data pointer directly and loop over it in C. This is the kind of scenario where that sort of thing can be well worth doing. -- Oscar