Path: csiph.com!eternal-september.org!feeder.eternal-september.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail From: Peter Otten <__peter__@web.de> Newsgroups: comp.lang.python Subject: Re: counting unique numpy subarrays Date: Sat, 05 Dec 2015 00:06:38 +0100 Organization: None Lines: 50 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7Bit X-Trace: news.uni-berlin.de 8FlFciXM+XIzEUQl0PvtjgbwPPDJ8s1ZhzNQEXsR98jQ== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'counting': 0.07; 'collections': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'slow.': 0.09; 'python': 0.10; 'def': 0.13; 'array)': 0.16; 'concatenated': 0.16; 'data)': 0.16; 'numpy': 0.16; 'pointers.': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'received:t-ipconnect.de': 0.16; 'seen:': 0.16; 'wrote:': 0.16; 'trying': 0.22; 'bit': 0.23; 'import': 0.24; "i've": 0.25; 'header :User-Agent:1': 0.26; 'header:X-Complaints-To:1': 0.26; 'skip:# 10': 0.27; 'data,': 0.27; 'skip:e 30': 0.27; 'array': 0.29; "i'm": 0.30; 'code': 0.30; 'useful': 0.33; 'returning': 0.35; 'but': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:org': 0.37; 'seem': 0.37; 'data': 0.39; 'along': 0.39; 'to:addr:python.org': 0.40; 'received:de': 0.40; 'hello,': 0.40; 'skip:n 10': 0.62; 'smith': 0.76; 'counts': 0.81 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: p57bd8cdc.dip0.t-ipconnect.de User-Agent: KNode/4.13.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:100021 duncan smith wrote: > Hello, > I'm trying to find a computationally efficient way of identifying > unique subarrays, counting them and returning an array containing only > the unique subarrays and a corresponding 1D array of counts. The > following code works, but is a bit slow. > > ############### > > from collections import Counter > import numpy > > def bag_data(data): > # data (a numpy array) is bagged along axis 0 > # returns concatenated array and corresponding array of counts > vec_shape = data.shape[1:] > counts = Counter(tuple(arr.flatten()) for arr in data) > data_out = numpy.zeros((len(counts),) + vec_shape) > cnts = numpy.zeros((len(counts,))) > for i, (tup, cnt) in enumerate(counts.iteritems()): > data_out[i] = numpy.array(tup).reshape(vec_shape) > cnts[i] = cnt > return data_out, cnts > > ############### > > I've been looking through the numpy docs, but don't seem to be able to > come up with a clean solution that avoids Python loops. Me neither :( > TIA for any > useful pointers. Cheers. Here's what I have so far: def bag_data(data): counts = numpy.zeros(data.shape[0]) seen = {} for i, arr in enumerate(data): sarr = arr.tostring() if sarr in seen: counts[seen[sarr]] += 1 else: seen[sarr] = i counts[i] = 1 nz = counts != 0 return numpy.compress(nz, data, axis=0), numpy.compress(nz, counts)