Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #29565 > unrolled thread

looping in array vs looping in a dic

Started bygiuseppe.amatulli@gmail.com
First post2012-09-20 11:31 -0700
Last post2012-09-20 16:35 -0700
Articles 7 — 3 participants

Back to article view | Back to comp.lang.python


Contents

  looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 11:31 -0700
    Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-20 20:09 +0100
    Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:28 -0600
    Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:29 -0600
      Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700
        Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-21 00:58 +0100
      Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700

#29565 — looping in array vs looping in a dic

Fromgiuseppe.amatulli@gmail.com
Date2012-09-20 11:31 -0700
Subjectlooping in array vs looping in a dic
Message-ID<007b2d71-3355-4085-b84f-204834b2c8d0@googlegroups.com>
Hi,  
I have this script in python that i need to apply for very large arrays (arrays coming from satellite images). 
The script works grate but i would like to speed up the process. 
The larger computational time is in the for loop process.
Is there is a way to improve that part?
Should be better to use dic() instead of np.ndarray for saving the results?
and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
If the dic() is the solution way is faster?

Thanks
Giuseppe

import numpy  as  np
import sys
from time import clock, time

# create the arrays

start = time()
valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = (time() - start)
print(elapsed , "create the data")

start = time()

categories = np.unique(valuesCategory)
matrix = np.c_[ categories , np.zeros(len(categories))]

elapsed = (time() - start)
print(elapsed , "create the matrix and append a colum zero ")

rows = 10
cols = 10

start = time()

for col in range(0,cols):
    for row in range(0,rows):
        for row_c in range(0,len(matrix)) :
            if valuesCategory[row,col] == matrix[row_c,0] :
                matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
                break
elapsed = (time() - start)
print(elapsed , "loop in the  data ")

print (matrix)

[toc] | [next] | [standalone]


#29567

FromMRAB <python@mrabarnett.plus.com>
Date2012-09-20 20:09 +0100
Message-ID<mailman.968.1348168201.27098.python-list@python.org>
In reply to#29565
On 2012-09-20 19:31, giuseppe.amatulli@gmail.com wrote:
> Hi,
> I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
> The script works grate but i would like to speed up the process.
> The larger computational time is in the for loop process.
> Is there is a way to improve that part?
> Should be better to use dic() instead of np.ndarray for saving the results?
> and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
> If the dic() is the solution way is faster?
>
> Thanks
> Giuseppe
>
> import numpy  as  np
> import sys
> from time import clock, time
>
> # create the arrays
>
> start = time()
> valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
> valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
>
> elapsed = (time() - start)
> print(elapsed , "create the data")
>
> start = time()
>
> categories = np.unique(valuesCategory)
> matrix = np.c_[ categories , np.zeros(len(categories))]
>
> elapsed = (time() - start)
> print(elapsed , "create the matrix and append a colum zero ")
>
> rows = 10
> cols = 10
>
> start = time()
>
> for col in range(0,cols):
>      for row in range(0,rows):
>          for row_c in range(0,len(matrix)) :
>              if valuesCategory[row,col] == matrix[row_c,0] :
>                  matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
>                  break
> elapsed = (time() - start)
> print(elapsed , "loop in the  data ")
>
> print (matrix)
>
If I understand the code correctly, 'matrix' contains the categories in
column 0 and the totals in column 1.

What you're doing is performing a linear search through the categories
and then adding to the corresponding total.

Linear searches are slow because on average you have to search through
half of the list. Using a dict would be much faster (although you
should of course measure it!).

Try something like this:

import numpy as np
from time import time

# Create the arrays.

start = time()

valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = time() - start
print(elapsed, "Create the data.")

start = time()

categories = np.unique(valuesCategory)
totals = dict.fromkeys(categories, 0)

elapsed = time() - start
print(elapsed, "Create the totals dict.")

rows = 100
cols = 10

start = time()

for col in range(cols):
     for row in range(rows):
         cat = valuesCategory[row, col]
         ras = valuesRaster[row, col]
         totals[cat] += ras

elapsed = time() - start
print(elapsed, "Loop in the data.")

print(totals)

[toc] | [prev] | [next] | [standalone]


#29569

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-09-20 13:28 -0600
Message-ID<mailman.969.1348169324.27098.python-list@python.org>
In reply to#29565
On Thu, Sep 20, 2012 at 1:09 PM, MRAB <python@mrabarnett.plus.com> wrote:
> for col in range(cols):
>     for row in range(rows):
>         cat = valuesCategory[row, col]
>         ras = valuesRaster[row, col]
>         totals[cat] += ras

Expanding on what MRAB wrote, since you probably have far fewer
categories than pixels, you may be able to take better advantage of
numpy's vectorized operations (which are pretty much the whole point
of using numpy in the first place) by looping over the categories
instead:

for cat in categories:
    totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))

[toc] | [prev] | [next] | [standalone]


#29571

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-09-20 13:29 -0600
Message-ID<mailman.971.1348169395.27098.python-list@python.org>
In reply to#29565
On Thu, Sep 20, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> Expanding on what MRAB wrote, since you probably have far fewer
> categories than pixels, you may be able to take better advantage of
> numpy's vectorized operations (which are pretty much the whole point
> of using numpy in the first place) by looping over the categories
> instead:
>
> for cat in categories:
>     totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))

Of course, that should have read:

for cat in categories:
    totals[cat] += np.sum(valuesRaster * (valuesCategory == cat))

[toc] | [prev] | [next] | [standalone]


#29584

Fromgiuseppe.amatulli@gmail.com
Date2012-09-20 16:35 -0700
Message-ID<f375e37c-d700-4ca5-b06b-2d195a5644de@googlegroups.com>
In reply to#29571
Hi Ian and MRAB
thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end. 
Should i create a new dic() or is possible to do in the same dic().
Here in the final code. 
Thanks Giuseppe
  


rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
    valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    for icols in xrange(cols):
        if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
            row = valuesCategory[0, icols],valuesRaster[0, icols]
            if row[0] in unique :
                unique[row[0]] += row[1]
            else:
                unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

[toc] | [prev] | [next] | [standalone]


#29587

FromMRAB <python@mrabarnett.plus.com>
Date2012-09-21 00:58 +0100
Message-ID<mailman.984.1348185486.27098.python-list@python.org>
In reply to#29584
On 2012-09-21 00:35, giuseppe.amatulli@gmail.com wrote:
> Hi Ian and MRAB
> thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
> In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
> Should i create a new dic() or is possible to do in the same dic().
> Here in the final code.
> Thanks Giuseppe
>
Keep it simple. Use 2 dicts.

>
>
> rows = dsCategory.RasterYSize
> cols = dsCategory.RasterXSize
>
> print("Generating output file %s" %(dst_file))
>
> start = time()
>
> unique=dict()
>
> for irows in xrange(rows):
>      valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
>      valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
>      for icols in xrange(cols):
>          if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
>              row = valuesCategory[0, icols],valuesRaster[0, icols]
>              if row[0] in unique :
>                  unique[row[0]] += row[1]
>              else:
>                  unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
>
You could use defaultdict instead:

from collections import defaultdict

unique = defaultdict(int)
...
              category, raster = valuesCategory[0, icols], 
valuesRaster[0, icols]
              unique[category] += raster

[toc] | [prev] | [next] | [standalone]


#29585

Fromgiuseppe.amatulli@gmail.com
Date2012-09-20 16:35 -0700
Message-ID<mailman.982.1348184123.27098.python-list@python.org>
In reply to#29571
Hi Ian and MRAB
thanks to you input i have improve the speed  of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end. 
Should i create a new dic() or is possible to do in the same dic().
Here in the final code. 
Thanks Giuseppe
  


rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
    valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
    for icols in xrange(cols):
        if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
            row = valuesCategory[0, icols],valuesRaster[0, icols]
            if row[0] in unique :
                unique[row[0]] += row[1]
            else:
                unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web