Groups > comp.lang.python > #94435 > unrolled thread

Optimizing if statement check over a numpy value

Started by	Heli Nix <hemla21@gmail.com>
First post	2015-07-23 02:21 -0700
Last post	2015-07-29 07:23 -0700
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

  Optimizing if statement  check over a numpy value Heli Nix <hemla21@gmail.com> - 2015-07-23 02:21 -0700
    Re: Optimizing if statement check over a numpy value MRAB <python@mrabarnett.plus.com> - 2015-07-23 10:55 +0100
    Re: Optimizing if statement check over a numpy value Laura Creighton <lac@openend.se> - 2015-07-23 12:13 +0200
    Re: Optimizing if statement  check over a numpy value Jeremy Sanders <jeremy@jeremysanders.net> - 2015-07-23 13:42 +0200
      Re: Optimizing if statement  check over a numpy value Heli Nix <hemla21@gmail.com> - 2015-07-29 07:23 -0700

#94435 — Optimizing if statement check over a numpy value

From	Heli Nix <hemla21@gmail.com>
Date	2015-07-23 02:21 -0700
Subject	Optimizing if statement check over a numpy value
Message-ID	<65c45685-dee1-41f8-a16a-7a062f4e7b02@googlegroups.com>

Dear all, 

I have the following piece of code. I am reading a numpy dataset from an hdf5 file and I am changing values to a new value if they equal 1. 

 There is 90 percent chance that (if id not in myList:) is true and in 10 percent of time is false. 

with h5py.File(inputFile, 'r') as f1:
    with h5py.File(inputFile2, 'w') as f2:
        ds=f1["MyDataset"].value
        myList=[list of Indices that must not be given the new_value]

        new_value=1e-20
        for index,val in     np.ndenumerate(ds):
            if val==1.0 :
                id=index[0]+1
                if id not in myList:
                    ds[index]=new_value
           
        dset1 = f2.create_dataset("Cell Ids", data=cellID_ds)  
        dset2 = f2.create_dataset("Porosity", data=poros_ds) 

My numpy array has 16M data and it takes 9 hrs to run. If I comment my if statement (if id not in myList:) it only takes 5 minutes to run. 

Is there any way that I can optimize this if statement. 

Thank you very much in Advance for your help. 

Best Regards,

[toc] | [next] | [standalone]

#94436 — Re: Optimizing if statement check over a numpy value

From	MRAB <python@mrabarnett.plus.com>
Date	2015-07-23 10:55 +0100
Subject	Re: Optimizing if statement check over a numpy value
Message-ID	<mailman.906.1437645368.3674.python-list@python.org>
In reply to	#94435

On 2015-07-23 10:21, Heli Nix wrote:
> Dear all,
>
> I have the following piece of code. I am reading a numpy dataset from an hdf5 file and I am changing values to a new value if they equal 1.
>
>   There is 90 percent chance that (if id not in myList:) is true and in 10 percent of time is false.
>
> with h5py.File(inputFile, 'r') as f1:
>      with h5py.File(inputFile2, 'w') as f2:
>          ds=f1["MyDataset"].value
>          myList=[list of Indices that must not be given the new_value]
>
>          new_value=1e-20
>          for index,val in     np.ndenumerate(ds):
>              if val==1.0 :
>                  id=index[0]+1
>                  if id not in myList:
>                      ds[index]=new_value
>
>          dset1 = f2.create_dataset("Cell Ids", data=cellID_ds)
>          dset2 = f2.create_dataset("Porosity", data=poros_ds)
>
> My numpy array has 16M data and it takes 9 hrs to run. If I comment my if statement (if id not in myList:) it only takes 5 minutes to run.
>
> Is there any way that I can optimize this if statement.
>
> Thank you very much in Advance for your help.
>
> Best Regards,
>
When checking for presence in a list, it has to check every entry. The
time taken is proportional to the length of the list.

The time taken to check for presence in a set, however, is a constant.

Replace the list myList with a set.

[toc] | [prev] | [next] | [standalone]

#94438 — Re: Optimizing if statement check over a numpy value

From	Laura Creighton <lac@openend.se>
Date	2015-07-23 12:13 +0200
Subject	Re: Optimizing if statement check over a numpy value
Message-ID	<mailman.908.1437646424.3674.python-list@python.org>
In reply to	#94435

Take a look at the sorted collection recipe:
http://code.activestate.com/recipes/577197-sortedcollection/

You want myList to be a sorted List.  You want lookups to be fast.

See if that improves things enough for you.  It may be possible to
have better speedups if instead of myList you write myTree and store
the values in a tree, depending on what the values of id are --  it
could be completely useless for you, as well.

Laura

[toc] | [prev] | [next] | [standalone]

#94444

From	Jeremy Sanders <jeremy@jeremysanders.net>
Date	2015-07-23 13:42 +0200
Message-ID	<mailman.912.1437651747.3674.python-list@python.org>
In reply to	#94435

Heli Nix wrote:

> Is there any way that I can optimize this if statement.

Array processing is much faster in numpy. Maybe this is close to what you 
want

import numpy as N
# input data
vals = N.array([42, 1, 5, 3.14, 53, 1, 12, 11, 1])
# list of items to exclude
exclude = [1]
# convert to a boolean array
exclbool = N.zeros(vals.shape, dtype=bool)
exclbool[exclude] = True
# do replacement
ones = vals==1.0
# Note: ~ is numpy.logical_not
vals[ones & (~exclbool)] = 1e-20

I think you'll have to convert your HDF array into a numpy array first, 
using numpy.array().

Jeremy

[toc] | [prev] | [next] | [standalone]

#94733

From	Heli Nix <hemla21@gmail.com>
Date	2015-07-29 07:23 -0700
Message-ID	<d0a2be39-9cb6-4dec-92a0-e13e47642b6b@googlegroups.com>
In reply to	#94444

On Thursday, July 23, 2015 at 1:43:00 PM UTC+2, Jeremy Sanders wrote:
> Heli Nix wrote:
> 
> > Is there any way that I can optimize this if statement.
> 
> Array processing is much faster in numpy. Maybe this is close to what you 
> want
> 
> import numpy as N
> # input data
> vals = N.array([42, 1, 5, 3.14, 53, 1, 12, 11, 1])
> # list of items to exclude
> exclude = [1]
> # convert to a boolean array
> exclbool = N.zeros(vals.shape, dtype=bool)
> exclbool[exclude] = True
> # do replacement
> ones = vals==1.0
> # Note: ~ is numpy.logical_not
> vals[ones & (~exclbool)] = 1e-20
> 
> I think you'll have to convert your HDF array into a numpy array first, 
> using numpy.array().
> 
> Jeremy

Dear all, 

I tried the sorted python list, but this did not really help the runtime. 

I haven´t had time to check the sorted collections.  I solved my runtime problem by using the script from Jeremy up here. 

It was a life saviour and it is amazing how powerful numpy is. Thanks a lot Jeremy for this. By the way, I did not have to do any array conversion. The array read from hdf5 file using h5py is already a numpy array. 

The runtime over an array of around 16M reduced from around 12 hours (previous script) to 3 seconds using numpy on the same machine. 

Thanks alot for your help,

[toc] | [prev] | [standalone]

csiph-web

Optimizing if statement check over a numpy value

Contents

#94435 — Optimizing if statement check over a numpy value

#94436 — Re: Optimizing if statement check over a numpy value

#94438 — Re: Optimizing if statement check over a numpy value

#94444

#94733