Groups > comp.lang.python > #100011 > unrolled thread

counting unique numpy subarrays

Started by	duncan smith <duncan@invalid.invalid>
First post	2015-12-04 19:43 +0000
Last post	2015-12-05 00:18 +0000
Articles	5 — 3 participants

Back to article view | Back to comp.lang.python

  counting unique numpy subarrays duncan smith <duncan@invalid.invalid> - 2015-12-04 19:43 +0000
    RE: counting unique numpy subarrays Albert-Jan Roskam <sjeik_appie@hotmail.com> - 2015-12-04 22:36 +0000
      Re: counting unique numpy subarrays duncan smith <duncan@invalid.invalid> - 2015-12-05 00:13 +0000
    Re: counting unique numpy subarrays Peter Otten <__peter__@web.de> - 2015-12-05 00:06 +0100
      Re: counting unique numpy subarrays duncan smith <duncan@invalid.invalid> - 2015-12-05 00:18 +0000

#100011 — counting unique numpy subarrays

From	duncan smith <duncan@invalid.invalid>
Date	2015-12-04 19:43 +0000
Subject	counting unique numpy subarrays
Message-ID	<Q1m8y.334924$rR1.113623@fx19.iad>

Hello,
      I'm trying to find a computationally efficient way of identifying
unique subarrays, counting them and returning an array containing only
the unique subarrays and a corresponding 1D array of counts. The
following code works, but is a bit slow.

###############

from collections import Counter
import numpy

def bag_data(data):
    # data (a numpy array) is bagged along axis 0
    # returns concatenated array and corresponding array of counts
    vec_shape = data.shape[1:]
    counts = Counter(tuple(arr.flatten()) for arr in data)
    data_out = numpy.zeros((len(counts),) + vec_shape)
    cnts = numpy.zeros((len(counts,)))
    for i, (tup, cnt) in enumerate(counts.iteritems()):
        data_out[i] = numpy.array(tup).reshape(vec_shape)
        cnts[i] =  cnt
    return data_out, cnts

###############

I've been looking through the numpy docs, but don't seem to be able to
come up with a clean solution that avoids Python loops. TIA for any
useful pointers. Cheers.

Duncan

[toc] | [next] | [standalone]

#100016

From	Albert-Jan Roskam <sjeik_appie@hotmail.com>
Date	2015-12-04 22:36 +0000
Message-ID	<mailman.208.1449268679.14615.python-list@python.org>
In reply to	#100011

Hi

(Sorry for topposting)

numpy.ravel is faster than numpy.flatten (no copy)
numpy.empty is faster than numpy.zeros
numpy.fromiter might be useful to avoid the loop (just a hunch)

Albert-Jan

> From: duncan@invalid.invalid
> Subject: counting unique numpy subarrays
> Date: Fri, 4 Dec 2015 19:43:35 +0000
> To: python-list@python.org
> 
> Hello,
>       I'm trying to find a computationally efficient way of identifying
> unique subarrays, counting them and returning an array containing only
> the unique subarrays and a corresponding 1D array of counts. The
> following code works, but is a bit slow.
> 
> ###############
> 
> from collections import Counter
> import numpy
> 
> def bag_data(data):
>     # data (a numpy array) is bagged along axis 0
>     # returns concatenated array and corresponding array of counts
>     vec_shape = data.shape[1:]
>     counts = Counter(tuple(arr.flatten()) for arr in data)
>     data_out = numpy.zeros((len(counts),) + vec_shape)
>     cnts = numpy.zeros((len(counts,)))
>     for i, (tup, cnt) in enumerate(counts.iteritems()):
>         data_out[i] = numpy.array(tup).reshape(vec_shape)
>         cnts[i] =  cnt
>     return data_out, cnts
> 
> ###############
> 
> I've been looking through the numpy docs, but don't seem to be able to
> come up with a clean solution that avoids Python loops. TIA for any
> useful pointers. Cheers.
> 
> Duncan
> -- 
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#100024

From	duncan smith <duncan@invalid.invalid>
Date	2015-12-05 00:13 +0000
Message-ID	<e%p8y.287627$dc2.166641@fx24.iad>
In reply to	#100016

On 04/12/15 22:36, Albert-Jan Roskam wrote:
> Hi
> 
> (Sorry for topposting)
> 
> numpy.ravel is faster than numpy.flatten (no copy)
> numpy.empty is faster than numpy.zeros
> numpy.fromiter might be useful to avoid the loop (just a hunch)
> 
> Albert-Jan
> 

Thanks, I'd forgotten the difference between numpy. flatten and
numpy.ravel. I wasn't even aware of numpy.empty.

Duncan

[toc] | [prev] | [next] | [standalone]

#100021

From	Peter Otten <__peter__@web.de>
Date	2015-12-05 00:06 +0100
Message-ID	<mailman.213.1449270605.14615.python-list@python.org>
In reply to	#100011

duncan smith wrote:

> Hello,
>       I'm trying to find a computationally efficient way of identifying
> unique subarrays, counting them and returning an array containing only
> the unique subarrays and a corresponding 1D array of counts. The
> following code works, but is a bit slow.
> 
> ###############
> 
> from collections import Counter
> import numpy
> 
> def bag_data(data):
>     # data (a numpy array) is bagged along axis 0
>     # returns concatenated array and corresponding array of counts
>     vec_shape = data.shape[1:]
>     counts = Counter(tuple(arr.flatten()) for arr in data)
>     data_out = numpy.zeros((len(counts),) + vec_shape)
>     cnts = numpy.zeros((len(counts,)))
>     for i, (tup, cnt) in enumerate(counts.iteritems()):
>         data_out[i] = numpy.array(tup).reshape(vec_shape)
>         cnts[i] =  cnt
>     return data_out, cnts
> 
> ###############
> 
> I've been looking through the numpy docs, but don't seem to be able to
> come up with a clean solution that avoids Python loops. 

Me neither :(

> TIA for any
> useful pointers. Cheers.

Here's what I have so far:

def bag_data(data):
    counts = numpy.zeros(data.shape[0])
    seen = {}
    for i, arr in enumerate(data):
        sarr = arr.tostring()
        if sarr in seen:
            counts[seen[sarr]] += 1
        else:
            seen[sarr] = i
            counts[i] = 1
    nz = counts != 0
    return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)

[toc] | [prev] | [next] | [standalone]

#100025

From	duncan smith <duncan@invalid.invalid>
Date	2015-12-05 00:18 +0000
Message-ID	<93q8y.177413$ij2.5605@fx08.iad>
In reply to	#100021

On 04/12/15 23:06, Peter Otten wrote:
> duncan smith wrote:
> 
>> Hello,
>>       I'm trying to find a computationally efficient way of identifying
>> unique subarrays, counting them and returning an array containing only
>> the unique subarrays and a corresponding 1D array of counts. The
>> following code works, but is a bit slow.
>>
>> ###############
>>
>> from collections import Counter
>> import numpy
>>
>> def bag_data(data):
>>     # data (a numpy array) is bagged along axis 0
>>     # returns concatenated array and corresponding array of counts
>>     vec_shape = data.shape[1:]
>>     counts = Counter(tuple(arr.flatten()) for arr in data)
>>     data_out = numpy.zeros((len(counts),) + vec_shape)
>>     cnts = numpy.zeros((len(counts,)))
>>     for i, (tup, cnt) in enumerate(counts.iteritems()):
>>         data_out[i] = numpy.array(tup).reshape(vec_shape)
>>         cnts[i] =  cnt
>>     return data_out, cnts
>>
>> ###############
>>
>> I've been looking through the numpy docs, but don't seem to be able to
>> come up with a clean solution that avoids Python loops. 
> 
> Me neither :(
> 
>> TIA for any
>> useful pointers. Cheers.
> 
> Here's what I have so far:
> 
> def bag_data(data):
>     counts = numpy.zeros(data.shape[0])
>     seen = {}
>     for i, arr in enumerate(data):
>         sarr = arr.tostring()
>         if sarr in seen:
>             counts[seen[sarr]] += 1
>         else:
>             seen[sarr] = i
>             counts[i] = 1
>     nz = counts != 0
>     return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)
> 

Three times as fast as what I had, and a bit cleaner. Excellent. Cheers.

Duncan

[toc] | [prev] | [standalone]

csiph-web

counting unique numpy subarrays

Contents

#100011 — counting unique numpy subarrays

#100016

#100024

#100021

#100025