Groups > comp.lang.python > #104501 > unrolled thread

looping and searching in numpy array

Started by	Heli <hemla21@gmail.com>
First post	2016-03-10 03:43 -0800
Last post	2016-03-14 15:22 +0000
Articles	9 — 6 participants

Back to article view | Back to comp.lang.python

  looping and searching in numpy array Heli <hemla21@gmail.com> - 2016-03-10 03:43 -0800
    Re: looping and searching in numpy array Peter Otten <__peter__@web.de> - 2016-03-10 14:02 +0100
      Re: looping and searching in numpy array Heli <hemla21@gmail.com> - 2016-03-10 08:48 -0800
        Re: looping and searching in numpy array Heli <hemla21@gmail.com> - 2016-03-10 08:50 -0800
        RE: looping and searching in numpy array Albert-Jan Roskam <sjeik_appie@hotmail.com> - 2016-03-13 13:51 +0000
        RE: looping and searching in numpy array Albert-Jan Roskam <sjeik_appie@hotmail.com> - 2016-03-13 15:43 +0000
    Re: looping and searching in numpy array Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 13:22 +0000
    Re: looping and searching in numpy array srinivas devaki <mr.eightnoteight@gmail.com> - 2016-03-14 10:19 +0530
    Re: looping and searching in numpy array Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-03-14 15:22 +0000

#104501 — looping and searching in numpy array

From	Heli <hemla21@gmail.com>
Date	2016-03-10 03:43 -0800
Subject	looping and searching in numpy array
Message-ID	<77bd470b-cc05-4117-9ed1-6309d7a5633a@googlegroups.com>

Dear all, 

I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values. 


for id in np.nditer(npArray1):
                  
       newId=(np.where(npArray2==id))[0][0]


Is there anyway I can make the above faster? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2  are not necessarily the same size, but they are both 1d. 


Thanks a lot for your help,

[toc] | [next] | [standalone]

#104511

From	Peter Otten <__peter__@web.de>
Date	2016-03-10 14:02 +0100
Message-ID	<mailman.126.1457614964.15725.python-list@python.org>
In reply to	#104501

Heli wrote:

> Dear all,
> 
> I need to loop over a numpy array and then do the following search. The
> following is taking almost 60(s) for an array (npArray1 and npArray2 in
> the example below) with around 300K values.
> 
> 
> for id in np.nditer(npArray1):
>                   
>        newId=(np.where(npArray2==id))[0][0]
> 
> 
> Is there anyway I can make the above faster? I need to run the script
> above on much bigger arrays (50M). Please note that my two numpy arrays in
> the lines above, npArray1 and npArray2  are not necessarily the same size,
> but they are both 1d.

You mean you are looking for the index of the first occurence in npArray2 
for every value of npArray1?

I don't know how to do this in numpy (I'm not an expert), but even basic 
Python might be acceptable:

lookup = {}
for i, v in enumerate(npArray2):
    if v not in lookup:
        lookup[v] = i

for v in npArray1:
    print(lookup.get(v, "<not found>"))

That way you iterate once (in Python) instead of 2*len(npArray1) times (in 
C) over npArray2.

[toc] | [prev] | [next] | [standalone]

#104532

From	Heli <hemla21@gmail.com>
Date	2016-03-10 08:48 -0800
Message-ID	<0fca7583-c051-48bd-a9be-5e501722fc77@googlegroups.com>
In reply to	#104511

On Thursday, March 10, 2016 at 2:02:57 PM UTC+1, Peter Otten wrote:
> Heli wrote:
> 
> > Dear all,
> > 
> > I need to loop over a numpy array and then do the following search. The
> > following is taking almost 60(s) for an array (npArray1 and npArray2 in
> > the example below) with around 300K values.
> > 
> > 
> > for id in np.nditer(npArray1):
> >                   
> >        newId=(np.where(npArray2==id))[0][0]
> > 
> > 
> > Is there anyway I can make the above faster? I need to run the script
> > above on much bigger arrays (50M). Please note that my two numpy arrays in
> > the lines above, npArray1 and npArray2  are not necessarily the same size,
> > but they are both 1d.
> 
> You mean you are looking for the index of the first occurence in npArray2 
> for every value of npArray1?
> 
> I don't know how to do this in numpy (I'm not an expert), but even basic 
> Python might be acceptable:
> 
> lookup = {}
> for i, v in enumerate(npArray2):
>     if v not in lookup:
>         lookup[v] = i
> 
> for v in npArray1:
>     print(lookup.get(v, "<not found>"))
> 
> That way you iterate once (in Python) instead of 2*len(npArray1) times (in 
> C) over npArray2.

Dear Peter, 

Thanks for your reply. This really helped. It reduces the script time from 61(s) to 2(s). 

I am still very interested in knowing the correct numpy way to do this, but till then your fix works great. 

Thanks a lot,

[toc] | [prev] | [next] | [standalone]

#104533

From	Heli <hemla21@gmail.com>
Date	2016-03-10 08:50 -0800
Message-ID	<c0f7b500-20ea-4ff1-ba9c-f707a4b8a957@googlegroups.com>
In reply to	#104532

On Thursday, March 10, 2016 at 5:49:07 PM UTC+1, Heli wrote:
> On Thursday, March 10, 2016 at 2:02:57 PM UTC+1, Peter Otten wrote:
> > Heli wrote:
> > 
> > > Dear all,
> > > 
> > > I need to loop over a numpy array and then do the following search. The
> > > following is taking almost 60(s) for an array (npArray1 and npArray2 in
> > > the example below) with around 300K values.
> > > 
> > > 
> > > for id in np.nditer(npArray1):
> > >                   
> > >        newId=(np.where(npArray2==id))[0][0]
> > > 
> > > 
> > > Is there anyway I can make the above faster? I need to run the script
> > > above on much bigger arrays (50M). Please note that my two numpy arrays in
> > > the lines above, npArray1 and npArray2  are not necessarily the same size,
> > > but they are both 1d.
> > 
> > You mean you are looking for the index of the first occurence in npArray2 
> > for every value of npArray1?
> > 
> > I don't know how to do this in numpy (I'm not an expert), but even basic 
> > Python might be acceptable:
> > 
> > lookup = {}
> > for i, v in enumerate(npArray2):
> >     if v not in lookup:
> >         lookup[v] = i
> > 
> > for v in npArray1:
> >     print(lookup.get(v, "<not found>"))
> > 
> > That way you iterate once (in Python) instead of 2*len(npArray1) times (in 
> > C) over npArray2.
> 
> Dear Peter, 
> 
> Thanks for your reply. This really helped. It reduces the script time from 61(s) to 2(s). 
> 
> I am still very interested in knowing the correct numpy way to do this, but till then your fix works great. 
> 
> Thanks a lot,

And yes, I am  looking for the index of the first occurence in npArray2 
for every value of npArray1.

[toc] | [prev] | [next] | [standalone]

#104767

From	Albert-Jan Roskam <sjeik_appie@hotmail.com>
Date	2016-03-13 13:51 +0000
Message-ID	<mailman.69.1457877150.12893.python-list@python.org>
In reply to	#104532


> Date: Thu, 10 Mar 2016 08:48:48 -0800
> Subject: Re: looping and searching in numpy array
> From: hemla21@gmail.com
> To: python-list@python.org
> 
> On Thursday, March 10, 2016 at 2:02:57 PM UTC+1, Peter Otten wrote:
> > Heli wrote:
> > 
> > > Dear all,
> > > 
> > > I need to loop over a numpy array and then do the following search. The
> > > following is taking almost 60(s) for an array (npArray1 and npArray2 in
> > > the example below) with around 300K values.
> > > 
> > > 
> > > for id in np.nditer(npArray1):
> > >                   
> > >        newId=(np.where(npArray2==id))[0][0]
> > > 
> > > 
> > > Is there anyway I can make the above faster? I need to run the script
> > > above on much bigger arrays (50M). Please note that my two numpy arrays in
> > > the lines above, npArray1 and npArray2  are not necessarily the same size,
> > > but they are both 1d.
> > 
> > You mean you are looking for the index of the first occurence in npArray2 
> > for every value of npArray1?
> > 
> > I don't know how to do this in numpy (I'm not an expert), but even basic 
> > Python might be acceptable:
> > 
> > lookup = {}
> > for i, v in enumerate(npArray2):
> >     if v not in lookup:
> >         lookup[v] = i
> > 
> > for v in npArray1:
> >     print(lookup.get(v, "<not found>"))
> > 
> > That way you iterate once (in Python) instead of 2*len(npArray1) times (in 
> > C) over npArray2.
> 
> Dear Peter, 
> 
> Thanks for your reply. This really helped. It reduces the script time from 61(s) to 2(s). 
> 
> I am still very interested in knowing the correct numpy way to do this, but till then your fix works great. 


Hi, I suppose you have seen this already (in particular the first link): http://numpy-discussion.10968.n7.nabble.com/Implementing-a-quot-find-first-quot-style-function-td33085.htmlI don't thonk it's part of numpy yet.
Albert-Jan

[toc] | [prev] | [next] | [standalone]

#104769

From	Albert-Jan Roskam <sjeik_appie@hotmail.com>
Date	2016-03-13 15:43 +0000
Message-ID	<mailman.70.1457883867.12893.python-list@python.org>
In reply to	#104532


> From: sjeik_appie@hotmail.com
> To: hemla21@gmail.com; python-list@python.org
> Subject: RE: looping and searching in numpy array
> Date: Sun, 13 Mar 2016 13:51:23 +0000



> 
> Hi, I suppose you have seen this already (in particular the first link): http://numpy-discussion.10968.n7.nabble.com/Implementing-a-quot-find-first-quot-style-function-td33085.htmlI don't thonk it's part of numpy yet.
> Albert-Jan

sorry, the correct url is: http://numpy-discussion.10968.n7.nabble.com/Implementing-a-quot-find-first-quot-style-function-td33085.html

[toc] | [prev] | [next] | [standalone]

#104514

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2016-03-10 13:22 +0000
Message-ID	<mailman.129.1457616312.15725.python-list@python.org>
In reply to	#104501

On 10/03/2016 11:43, Heli wrote:
> Dear all,
>
> I need to loop over a numpy array and then do the following search. The following is taking almost 60(s) for an array (npArray1 and npArray2 in the example below) with around 300K values.
>
>
> for id in np.nditer(npArray1):
>
>         newId=(np.where(npArray2==id))[0][0]
>
>
> Is there anyway I can make the above faster? I need to run the script above on much bigger arrays (50M). Please note that my two numpy arrays in the lines above, npArray1 and npArray2  are not necessarily the same size, but they are both 1d.
>
>
> Thanks a lot for your help,
>

I'm no numpy expert, but if you're using a loop my guess is that you're 
doing it wrong.  I suggest your first port of call is the numpy docs if 
you haven't all ready been there, then the specific numpy mailing list 
or stackoverflow, as it seems very likely that this type of question has 
been asked before.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#104799

From	srinivas devaki <mr.eightnoteight@gmail.com>
Date	2016-03-14 10:19 +0530
Message-ID	<mailman.85.1457930987.12893.python-list@python.org>
In reply to	#104501

problem is infact not related to numpy at all. the complexity of your
algorithm is O(len(npArray1) * len(npArray2))

which means the number of computations that you are doing is in the range
of 10**10,

if the absolute difference between the maximum element and minimum element
is less than 10**6, you can improve your code by pre-computing the first
occurrence of a number by using an array of size of that difference(afore
mentioned).

if your npArray2 doesn't have such a pattern, you have to precompute it by
using a dict (I don't know if numpy has such data structure)

an optimised pseudo code would look like

mmdiff = max(npArray2) - min(npArray2)
if mmdiff < 10**6:
    precom = np.array([-1] * mmdiff)
    offset = min(npArray2)
    for i, x in enumerate(npArray2):
        precom[x - offset] = i
    for id in npArray1:
        if 0 <= id - offset < mmdiff and precom[id - offset] != -1:
            new_id = precom[id]
            # your code
else:
    precom = {}
    for i, x in enumerate(npArray1):
        if x not in precom:
            precom[x] = i
    for id in npArray1:
        if id in precom:
            new_id = precom[id]
            # your code


you can just use the else case which will work for all cases but if your
npArray2 has such a pattern then the above code will perform better.

Regards
Srinivas Devaki
Junior (3rd yr) student at Indian School of Mines,(IIT Dhanbad)
Computer Science and Engineering Department
ph: +91 9491 383 249
telegram_id: @eightnoteight
On Mar 10, 2016 5:15 PM, "Heli" <hemla21@gmail.com> wrote:

Dear all,

I need to loop over a numpy array and then do the following search. The
following is taking almost 60(s) for an array (npArray1 and npArray2 in the
example below) with around 300K values.


for id in np.nditer(npArray1):

       newId=(np.where(npArray2==id))[0][0]


Is there anyway I can make the above faster? I need to run the script above
on much bigger arrays (50M). Please note that my two numpy arrays in the
lines above, npArray1 and npArray2  are not necessarily the same size, but
they are both 1d.


Thanks a lot for your help,

--
https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#104823

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-03-14 15:22 +0000
Message-ID	<mailman.104.1457968972.12893.python-list@python.org>
In reply to	#104501

On 10 March 2016 at 13:02, Peter Otten <__peter__@web.de> wrote:
> Heli wrote:
>
>> I need to loop over a numpy array and then do the following search. The
>> following is taking almost 60(s) for an array (npArray1 and npArray2 in
>> the example below) with around 300K values.
>>
>>
>> for id in np.nditer(npArray1):
>>        newId=(np.where(npArray2==id))[0][0]

What are the dtypes of the arrays? And what are the typical sizes of
each of them. It can have a big effect on what makes a good solution
to the problem.

>> Is there anyway I can make the above faster? I need to run the script
>> above on much bigger arrays (50M). Please note that my two numpy arrays in
>> the lines above, npArray1 and npArray2  are not necessarily the same size,
>> but they are both 1d.
>
> You mean you are looking for the index of the first occurence in npArray2
> for every value of npArray1?
>
> I don't know how to do this in numpy (I'm not an expert), but even basic
> Python might be acceptable:

I'm not sure that numpy has any particular function that can be of use
here. Your approach below looks good though.

> lookup = {}
> for i, v in enumerate(npArray2):
>     if v not in lookup:
>         lookup[v] = i

Looking at this I wondered if there was a way to avoid the double hash
table lookup and realised it's the first time I've ever considered a
use for setdefault:

for i, v in enumerate(npArray2):
     lookup.setdefault(i, v)

Another option would be to use this same algorithm in Cython. Then you
can access the ndarray data pointer directly and loop over it in C.
This is the kind of scenario where that sort of thing can be well
worth doing.

--
Oscar

[toc] | [prev] | [standalone]

csiph-web

looping and searching in numpy array

Contents

#104501 — looping and searching in numpy array

#104511

#104532

#104533

#104767

#104769

#104514

#104799

#104823