Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.albasani.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
From: Algis Kabaila <akabaila@pcug.org.au>
Organization: PCUG - Users Helping Users
To: python-list@python.org
Subject: Re: memory usage multi value hash
Date: Fri, 15 Apr 2011 18:01:45 +1000
User-Agent: KMail/1.13.5 (Linux/2.6.35-25-generic-pae; KDE/4.5.1; i686; ; )
References: <9e79c6fe-ea6c-4849-bf7a-1b596ff37ecc@r35g2000prj.googlegroups.com>
In-Reply-To: <9e79c6fe-ea6c-4849-bf7a-1b596ff37ecc@r35g2000prj.googlegroups.com>
MIME-Version: 1.0
Content-Type: Text/Plain; charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Cc: christian <ozric@web.de>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.384.1302854925.9059.python-list@python.org>
Lines: 104
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:3249

On Friday 15 April 2011 02:13:51 christian wrote:
> Hello,
> 
> i'm not very experienced in python. Is there a way doing
> below more memory efficient and maybe faster.
> I import a  2-column file and  then concat for every unique
> value in the first column ( key) the value from the second
> columns.
> 
> So The ouptut is something like that.
> A,1,2,3
> B,3,4
> C,9,10,11,12,90,34,322,21
> 
> 
> Thanks for advance & regards,
> Christian
> 
> 
> import csv
> import random
> import sys
> from itertools import groupby
> from operator import itemgetter
> 
> f=csv.reader(open(sys.argv[1]),delimiter=';')
> z=[[i[0],i[1]] for i in f]
> z.sort(key=itemgetter(0))
> mydict = dict((k,','.join(map(itemgetter(1), it)))
>            for k, it in groupby(z, itemgetter(0)))
> del(z)
> 
> f = open(sys.argv[2], 'w')
> for k,v in mydict.iteritems():
>     f.write(v + "\n")
> 
> f.close()
Two alternative solutions - the second one with generators is 
probably the  most economical as far as RAM usage is concerned.

For  you example data1.txt is taken as follows:
A, 1
B, 3
C, 9
A, 2
B, 4
C, 10
A, 3
C, 11
C, 12
C, 90
C, 34
C, 322
C, 21

The "two in one" program is:
#!/usr/bin python
'''generate.py - Example of reading long two column csv list and
sorting. Thread "memory usage multi value hash"
'''

# Determine a set of unique column 1 values
unique_set = set()
with open('data1.txt') as f:
    for line in f:
        unique_set.add(line.split(',')[0])
    print(unique_set)
with open('data1.txt') as f:
    for x in unique_set:
        ls = [line.split(',')[1].rstrip() for line in f if 
line.split(',')[0].rstrip() == x]
        print(x.rstrip(), ','.join(ls))
        f.seek(0)

print ('\n Alternative solution with generators')
with open('data1.txt') as f:
    for x in unique_set:
        gs = (line.split(',')[1].rstrip() for line in f if 
line.split(',')[0].rstrip() == x)
        s = ''
        for ds in gs:
            s = s + ds
        print(x.rstrip(), s)
        f.seek(0)

The output is:
{'A', 'C', 'B'}
A  1, 2, 3
C  9, 10, 11, 12, 90, 34, 322, 21
B  3, 4

 Alternative solution with generators
A  1 2 3
C  9 10 11 12 90 34 322 21
B  3 4

Notice that data sequence could be different, without any effect 
on output.

OldAl.

-- 
Algis
http://akabaila.pcug.org.au/StructuralAnalysis.pdf