Subject: Re: Optimizing if statement check over a numpy value
To: python-list@python.org
References: <65c45685-dee1-41f8-a16a-7a062f4e7b02@googlegroups.com>
From: MRAB <python@mrabarnett.plus.com>
Date: Thu, 23 Jul 2015 10:55:58 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0
MIME-Version: 1.0
In-Reply-To: <65c45685-dee1-41f8-a16a-7a062f4e7b02@googlegroups.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.906.1437645368.3674.python-list@python.org>
Lines: 37
NNTP-Posting-Host: 2001:888:2000:d::a6
Path: csiph.com!usenet.pasdenom.info!news.stben.net!border1.nntp.ams1.giganews.com!nntp.giganews.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Xref: csiph.com comp.lang.python:94436

On 2015-07-23 10:21, Heli Nix wrote:
> Dear all,
>
> I have the following piece of code. I am reading a numpy dataset from an hdf5 file and I am changing values to a new value if they equal 1.
>
>   There is 90 percent chance that (if id not in myList:) is true and in 10 percent of time is false.
>
> with h5py.File(inputFile, 'r') as f1:
>      with h5py.File(inputFile2, 'w') as f2:
>          ds=f1["MyDataset"].value
>          myList=[list of Indices that must not be given the new_value]
>
>          new_value=1e-20
>          for index,val in     np.ndenumerate(ds):
>              if val==1.0 :
>                  id=index[0]+1
>                  if id not in myList:
>                      ds[index]=new_value
>
>          dset1 = f2.create_dataset("Cell Ids", data=cellID_ds)
>          dset2 = f2.create_dataset("Porosity", data=poros_ds)
>
> My numpy array has 16M data and it takes 9 hrs to run. If I comment my if statement (if id not in myList:) it only takes 5 minutes to run.
>
> Is there any way that I can optimize this if statement.
>
> Thank you very much in Advance for your help.
>
> Best Regards,
>
When checking for presence in a list, it has to check every entry. The
time taken is proportional to the length of the list.

The time taken to check for presence in a set, however, is a constant.

Replace the list myList with a set.