Groups > comp.lang.python > #29565 > unrolled thread

looping in array vs looping in a dic

Started by	giuseppe.amatulli@gmail.com
First post	2012-09-20 11:31 -0700
Last post	2012-09-20 16:35 -0700
Articles	7 — 3 participants

Back to article view | Back to comp.lang.python

  looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 11:31 -0700
    Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-20 20:09 +0100
    Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:28 -0600
    Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:29 -0600
      Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700
        Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-21 00:58 +0100
      Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700

#29565 — looping in array vs looping in a dic

From	giuseppe.amatulli@gmail.com
Date	2012-09-20 11:31 -0700
Subject	looping in array vs looping in a dic
Message-ID	<007b2d71-3355-4085-b84f-204834b2c8d0@googlegroups.com>

Hi,  
I have this script in python that i need to apply for very large arrays (arrays coming from satellite images). 
The script works grate but i would like to speed up the process. 
The larger computational time is in the for loop process.
Is there is a way to improve that part?
Should be better to use dic() instead of np.ndarray for saving the results?
and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
If the dic() is the solution way is faster?

Thanks
Giuseppe

import numpy  as  np
import sys
from time import clock, time

# create the arrays

start = time()
valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = (time() - start)
print(elapsed , "create the data")

start = time()

categories = np.unique(valuesCategory)
matrix = np.c_[ categories , np.zeros(len(categories))]

elapsed = (time() - start)
print(elapsed , "create the matrix and append a colum zero ")

rows = 10
cols = 10

start = time()

for col in range(0,cols):
    for row in range(0,rows):
        for row_c in range(0,len(matrix)) :
            if valuesCategory[row,col] == matrix[row_c,0] :
                matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
                break
elapsed = (time() - start)
print(elapsed , "loop in the  data ")

print (matrix)

[toc] | [next] | [standalone]

#29567

From	MRAB <python@mrabarnett.plus.com>
Date	2012-09-20 20:09 +0100
Message-ID	<mailman.968.1348168201.27098.python-list@python.org>
In reply to	#29565

On 2012-09-20 19:31, giuseppe.amatulli@gmail.com wrote:
> Hi,
> I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
> The script works grate but i would like to speed up the process.
> The larger computational time is in the for loop process.
> Is there is a way to improve that part?
> Should be better to use dic() instead of np.ndarray for saving the results?
> and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
> If the dic() is the solution way is faster?
>
> Thanks
> Giuseppe
>
> import numpy  as  np
> import sys
> from time import clock, time
>
> # create the arrays
>
> start = time()
> valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
> valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
>
> elapsed = (time() - start)
> print(elapsed , "create the data")
>
> start = time()
>
> categories = np.unique(valuesCategory)
> matrix = np.c_[ categories , np.zeros(len(categories))]
>
> elapsed = (time() - start)
> print(elapsed , "create the matrix and append a colum zero ")
>
> rows = 10
> cols = 10
>
> start = time()
>
> for col in range(0,cols):
>      for row in range(0,rows):
>          for row_c in range(0,len(matrix)) :
>              if valuesCategory[row,col] == matrix[row_c,0] :
>                  matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
>                  break
> elapsed = (time() - start)
> print(elapsed , "loop in the  data ")
>
> print (matrix)
>
If I understand the code correctly, 'matrix' contains the categories in
column 0 and the totals in column 1.

What you're doing is performing a linear search through the categories
and then adding to the corresponding total.

Linear searches are slow because on average you have to search through
half of the list. Using a dict would be much faster (although you
should of course measure it!).

Try something like this:

import numpy as np
from time import time

# Create the arrays.

start = time()

valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = time() - start
print(elapsed, "Create the data.")

start = time()

categories = np.unique(valuesCategory)
totals = dict.fromkeys(categories, 0)

elapsed = time() - start
print(elapsed, "Create the totals dict.")

rows = 100
cols = 10

start = time()

for col in range(cols):
     for row in range(rows):
         cat = valuesCategory[row, col]
         ras = valuesRaster[row, col]
         totals[cat] += ras

elapsed = time() - start
print(elapsed, "Loop in the data.")

print(totals)

[toc] | [prev] | [next] | [standalone]

#29569

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-20 13:28 -0600
Message-ID	<mailman.969.1348169324.27098.python-list@python.org>
In reply to	#29565

On Thu, Sep 20, 2012 at 1:09 PM, MRAB <python@mrabarnett.plus.com> wrote:
> for col in range(cols):
>     for row in range(rows):
>         cat = valuesCategory[row, col]
>         ras = valuesRaster[row, col]
>         totals[cat] += ras

Expanding on what MRAB wrote, since you probably have far fewer
categories than pixels, you may be able to take better advantage of
numpy's vectorized operations (which are pretty much the whole point
of using numpy in the first place) by looping over the categories
instead:

for cat in categories:
    totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))

[toc] | [prev] | [next] | [standalone]

#29571

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-20 13:29 -0600
Message-ID	<mailman.971.1348169395.27098.python-list@python.org>
In reply to	#29565

On Thu, Sep 20, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> Expanding on what MRAB wrote, since you probably have far fewer
> categories than pixels, you may be able to take better advantage of
> numpy's vectorized operations (which are pretty much the whole point
> of using numpy in the first place) by looping over the categories
> instead:
>
> for cat in categories:
>     totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))

Of course, that should have read:

for cat in categories:
    totals[cat] += np.sum(valuesRaster * (valuesCategory == cat))

[toc] | [prev] | [next] | [standalone]

#29584

From	giuseppe.amatulli@gmail.com
Date	2012-09-20 16:35 -0700
Message-ID	<f375e37c-d700-4ca5-b06b-2d195a5644de@googlegroups.com>
In reply to	#29571

Hi Ian and MRAB
thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end. 
Should i create a new dic() or is possible to do in the same dic().
Here in the final code. 
Thanks Giuseppe
  


rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
    valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    for icols in xrange(cols):
        if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
            row = valuesCategory[0, icols],valuesRaster[0, icols]
            if row[0] in unique :
                unique[row[0]] += row[1]
            else:
                unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

[toc] | [prev] | [next] | [standalone]

#29587

From	MRAB <python@mrabarnett.plus.com>
Date	2012-09-21 00:58 +0100
Message-ID	<mailman.984.1348185486.27098.python-list@python.org>
In reply to	#29584

On 2012-09-21 00:35, giuseppe.amatulli@gmail.com wrote:
> Hi Ian and MRAB
> thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
> In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
> Should i create a new dic() or is possible to do in the same dic().
> Here in the final code.
> Thanks Giuseppe
>
Keep it simple. Use 2 dicts.

>
>
> rows = dsCategory.RasterYSize
> cols = dsCategory.RasterXSize
>
> print("Generating output file %s" %(dst_file))
>
> start = time()
>
> unique=dict()
>
> for irows in xrange(rows):
>      valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
>      valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
>      for icols in xrange(cols):
>          if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
>              row = valuesCategory[0, icols],valuesRaster[0, icols]
>              if row[0] in unique :
>                  unique[row[0]] += row[1]
>              else:
>                  unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
>
You could use defaultdict instead:

from collections import defaultdict

unique = defaultdict(int)
...
              category, raster = valuesCategory[0, icols], 
valuesRaster[0, icols]
              unique[category] += raster

[toc] | [prev] | [next] | [standalone]

#29585

From	giuseppe.amatulli@gmail.com
Date	2012-09-20 16:35 -0700
Message-ID	<mailman.982.1348184123.27098.python-list@python.org>
In reply to	#29571

Hi Ian and MRAB
thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end. 
Should i create a new dic() or is possible to do in the same dic().
Here in the final code. 
Thanks Giuseppe
  


rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
    valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    for icols in xrange(cols):
        if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
            row = valuesCategory[0, icols],valuesRaster[0, icols]
            if row[0] in unique :
                unique[row[0]] += row[1]
            else:
                unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

[toc] | [prev] | [standalone]

csiph-web

looping in array vs looping in a dic

Contents

#29565 — looping in array vs looping in a dic

#29567

#29569

#29571

#29584

#29587

#29585