Path: csiph.com!eternal-september.org!feeder.eternal-september.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Peter Otten <__peter__@web.de>
Newsgroups: comp.lang.python
Subject: Re: counting unique numpy subarrays
Date: Sat, 05 Dec 2015 00:06:38 +0100
Organization: None
Lines: 50
Message-ID: <mailman.213.1449270605.14615.python-list@python.org>
References: <Q1m8y.334924$rR1.113623@fx19.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
User-Agent: KNode/4.13.3
Precedence: list
Xref: csiph.com comp.lang.python:100021

duncan smith wrote:

> Hello,
>       I'm trying to find a computationally efficient way of identifying
> unique subarrays, counting them and returning an array containing only
> the unique subarrays and a corresponding 1D array of counts. The
> following code works, but is a bit slow.
> 
> ###############
> 
> from collections import Counter
> import numpy
> 
> def bag_data(data):
>     # data (a numpy array) is bagged along axis 0
>     # returns concatenated array and corresponding array of counts
>     vec_shape = data.shape[1:]
>     counts = Counter(tuple(arr.flatten()) for arr in data)
>     data_out = numpy.zeros((len(counts),) + vec_shape)
>     cnts = numpy.zeros((len(counts,)))
>     for i, (tup, cnt) in enumerate(counts.iteritems()):
>         data_out[i] = numpy.array(tup).reshape(vec_shape)
>         cnts[i] =  cnt
>     return data_out, cnts
> 
> ###############
> 
> I've been looking through the numpy docs, but don't seem to be able to
> come up with a clean solution that avoids Python loops. 

Me neither :(

> TIA for any
> useful pointers. Cheers.

Here's what I have so far:

def bag_data(data):
    counts = numpy.zeros(data.shape[0])
    seen = {}
    for i, arr in enumerate(data):
        sarr = arr.tostring()
        if sarr in seen:
            counts[seen[sarr]] += 1
        else:
            seen[sarr] = i
            counts[i] = 1
    nz = counts != 0
    return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)