Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #29565 > unrolled thread
| Started by | giuseppe.amatulli@gmail.com |
|---|---|
| First post | 2012-09-20 11:31 -0700 |
| Last post | 2012-09-20 16:35 -0700 |
| Articles | 7 — 3 participants |
Back to article view | Back to comp.lang.python
looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 11:31 -0700
Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-20 20:09 +0100
Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:28 -0600
Re: looping in array vs looping in a dic Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-20 13:29 -0600
Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700
Re: looping in array vs looping in a dic MRAB <python@mrabarnett.plus.com> - 2012-09-21 00:58 +0100
Re: looping in array vs looping in a dic giuseppe.amatulli@gmail.com - 2012-09-20 16:35 -0700
| From | giuseppe.amatulli@gmail.com |
|---|---|
| Date | 2012-09-20 11:31 -0700 |
| Subject | looping in array vs looping in a dic |
| Message-ID | <007b2d71-3355-4085-b84f-204834b2c8d0@googlegroups.com> |
Hi,
I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
The script works grate but i would like to speed up the process.
The larger computational time is in the for loop process.
Is there is a way to improve that part?
Should be better to use dic() instead of np.ndarray for saving the results?
and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
If the dic() is the solution way is faster?
Thanks
Giuseppe
import numpy as np
import sys
from time import clock, time
# create the arrays
start = time()
valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
elapsed = (time() - start)
print(elapsed , "create the data")
start = time()
categories = np.unique(valuesCategory)
matrix = np.c_[ categories , np.zeros(len(categories))]
elapsed = (time() - start)
print(elapsed , "create the matrix and append a colum zero ")
rows = 10
cols = 10
start = time()
for col in range(0,cols):
for row in range(0,rows):
for row_c in range(0,len(matrix)) :
if valuesCategory[row,col] == matrix[row_c,0] :
matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
break
elapsed = (time() - start)
print(elapsed , "loop in the data ")
print (matrix)
[toc] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-09-20 20:09 +0100 |
| Message-ID | <mailman.968.1348168201.27098.python-list@python.org> |
| In reply to | #29565 |
On 2012-09-20 19:31, giuseppe.amatulli@gmail.com wrote:
> Hi,
> I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
> The script works grate but i would like to speed up the process.
> The larger computational time is in the for loop process.
> Is there is a way to improve that part?
> Should be better to use dic() instead of np.ndarray for saving the results?
> and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
> If the dic() is the solution way is faster?
>
> Thanks
> Giuseppe
>
> import numpy as np
> import sys
> from time import clock, time
>
> # create the arrays
>
> start = time()
> valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
> valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
>
> elapsed = (time() - start)
> print(elapsed , "create the data")
>
> start = time()
>
> categories = np.unique(valuesCategory)
> matrix = np.c_[ categories , np.zeros(len(categories))]
>
> elapsed = (time() - start)
> print(elapsed , "create the matrix and append a colum zero ")
>
> rows = 10
> cols = 10
>
> start = time()
>
> for col in range(0,cols):
> for row in range(0,rows):
> for row_c in range(0,len(matrix)) :
> if valuesCategory[row,col] == matrix[row_c,0] :
> matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
> break
> elapsed = (time() - start)
> print(elapsed , "loop in the data ")
>
> print (matrix)
>
If I understand the code correctly, 'matrix' contains the categories in
column 0 and the totals in column 1.
What you're doing is performing a linear search through the categories
and then adding to the corresponding total.
Linear searches are slow because on average you have to search through
half of the list. Using a dict would be much faster (although you
should of course measure it!).
Try something like this:
import numpy as np
from time import time
# Create the arrays.
start = time()
valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
elapsed = time() - start
print(elapsed, "Create the data.")
start = time()
categories = np.unique(valuesCategory)
totals = dict.fromkeys(categories, 0)
elapsed = time() - start
print(elapsed, "Create the totals dict.")
rows = 100
cols = 10
start = time()
for col in range(cols):
for row in range(rows):
cat = valuesCategory[row, col]
ras = valuesRaster[row, col]
totals[cat] += ras
elapsed = time() - start
print(elapsed, "Loop in the data.")
print(totals)
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-09-20 13:28 -0600 |
| Message-ID | <mailman.969.1348169324.27098.python-list@python.org> |
| In reply to | #29565 |
On Thu, Sep 20, 2012 at 1:09 PM, MRAB <python@mrabarnett.plus.com> wrote:
> for col in range(cols):
> for row in range(rows):
> cat = valuesCategory[row, col]
> ras = valuesRaster[row, col]
> totals[cat] += ras
Expanding on what MRAB wrote, since you probably have far fewer
categories than pixels, you may be able to take better advantage of
numpy's vectorized operations (which are pretty much the whole point
of using numpy in the first place) by looping over the categories
instead:
for cat in categories:
totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-09-20 13:29 -0600 |
| Message-ID | <mailman.971.1348169395.27098.python-list@python.org> |
| In reply to | #29565 |
On Thu, Sep 20, 2012 at 1:28 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> Expanding on what MRAB wrote, since you probably have far fewer
> categories than pixels, you may be able to take better advantage of
> numpy's vectorized operations (which are pretty much the whole point
> of using numpy in the first place) by looping over the categories
> instead:
>
> for cat in categories:
> totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))
Of course, that should have read:
for cat in categories:
totals[cat] += np.sum(valuesRaster * (valuesCategory == cat))
[toc] | [prev] | [next] | [standalone]
| From | giuseppe.amatulli@gmail.com |
|---|---|
| Date | 2012-09-20 16:35 -0700 |
| Message-ID | <f375e37c-d700-4ca5-b06b-2d195a5644de@googlegroups.com> |
| In reply to | #29571 |
Hi Ian and MRAB
thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
Should i create a new dic() or is possible to do in the same dic().
Here in the final code.
Thanks Giuseppe
rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize
print("Generating output file %s" %(dst_file))
start = time()
unique=dict()
for irows in xrange(rows):
valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
for icols in xrange(cols):
if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
row = valuesCategory[0, icols],valuesRaster[0, icols]
if row[0] in unique :
unique[row[0]] += row[1]
else:
unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-09-21 00:58 +0100 |
| Message-ID | <mailman.984.1348185486.27098.python-list@python.org> |
| In reply to | #29584 |
On 2012-09-21 00:35, giuseppe.amatulli@gmail.com wrote:
> Hi Ian and MRAB
> thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
> In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
> Should i create a new dic() or is possible to do in the same dic().
> Here in the final code.
> Thanks Giuseppe
>
Keep it simple. Use 2 dicts.
>
>
> rows = dsCategory.RasterYSize
> cols = dsCategory.RasterXSize
>
> print("Generating output file %s" %(dst_file))
>
> start = time()
>
> unique=dict()
>
> for irows in xrange(rows):
> valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
> valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
> for icols in xrange(cols):
> if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
> row = valuesCategory[0, icols],valuesRaster[0, icols]
> if row[0] in unique :
> unique[row[0]] += row[1]
> else:
> unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
>
You could use defaultdict instead:
from collections import defaultdict
unique = defaultdict(int)
...
category, raster = valuesCategory[0, icols],
valuesRaster[0, icols]
unique[category] += raster
[toc] | [prev] | [next] | [standalone]
| From | giuseppe.amatulli@gmail.com |
|---|---|
| Date | 2012-09-20 16:35 -0700 |
| Message-ID | <mailman.982.1348184123.27098.python-list@python.org> |
| In reply to | #29571 |
Hi Ian and MRAB
thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
Should i create a new dic() or is possible to do in the same dic().
Here in the final code.
Thanks Giuseppe
rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize
print("Generating output file %s" %(dst_file))
start = time()
unique=dict()
for irows in xrange(rows):
valuesRaster=dsRaster.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
valuesCategory=dsCategory.GetRasterBand(1).ReadAsArray(0,irows,cols,1)
for icols in xrange(cols):
if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
row = valuesCategory[0, icols],valuesRaster[0, icols]
if row[0] in unique :
unique[row[0]] += row[1]
else:
unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web