Groups > comp.lang.python > #3202 > unrolled thread

memory usage multi value hash

Started by	christian <ozric@web.de>
First post	2011-04-14 09:13 -0700
Last post	2011-04-15 18:01 +1000
Articles	5 — 4 participants

Back to article view | Back to comp.lang.python

  memory usage multi value hash christian <ozric@web.de> - 2011-04-14 09:13 -0700
    Re: memory usage multi value hash Peter Otten <__peter__@web.de> - 2011-04-14 18:55 +0200
    Re: memory usage multi value hash Terry Reedy <tjreedy@udel.edu> - 2011-04-14 13:28 -0400
      Re: memory usage multi value hash Peter Otten <__peter__@web.de> - 2011-04-15 10:15 +0200
    Re: memory usage multi value hash Algis Kabaila <akabaila@pcug.org.au> - 2011-04-15 18:01 +1000

#3202 — memory usage multi value hash

From	christian <ozric@web.de>
Date	2011-04-14 09:13 -0700
Subject	memory usage multi value hash
Message-ID	<9e79c6fe-ea6c-4849-bf7a-1b596ff37ecc@r35g2000prj.googlegroups.com>

Hello,

i'm not very experienced in python. Is there a way doing below more
memory efficient and maybe faster.
I import a  2-column file and  then concat for every unique value in
the first column ( key) the value from the second
columns.

So The ouptut is something like that.
A,1,2,3
B,3,4
C,9,10,11,12,90,34,322,21


Thanks for advance & regards,
Christian


import csv
import random
import sys
from itertools import groupby
from operator import itemgetter

f=csv.reader(open(sys.argv[1]),delimiter=';')
z=[[i[0],i[1]] for i in f]
z.sort(key=itemgetter(0))
mydict = dict((k,','.join(map(itemgetter(1), it)))
           for k, it in groupby(z, itemgetter(0)))
del(z)

f = open(sys.argv[2], 'w')
for k,v in mydict.iteritems():
    f.write(v + "\n")

f.close()

[toc] | [next] | [standalone]

#3207

From	Peter Otten <__peter__@web.de>
Date	2011-04-14 18:55 +0200
Message-ID	<mailman.365.1302800084.9059.python-list@python.org>
In reply to	#3202

christian wrote:

> Hello,
> 
> i'm not very experienced in python. Is there a way doing below more
> memory efficient and maybe faster.
> I import a  2-column file and  then concat for every unique value in
> the first column ( key) the value from the second
> columns.
> 
> So The ouptut is something like that.
> A,1,2,3
> B,3,4
> C,9,10,11,12,90,34,322,21
> 
> 
> Thanks for advance & regards,
> Christian
> 
> 
> import csv
> import random
> import sys
> from itertools import groupby
> from operator import itemgetter
> 
> f=csv.reader(open(sys.argv[1]),delimiter=';')
> z=[[i[0],i[1]] for i in f]
> z.sort(key=itemgetter(0))
> mydict = dict((k,','.join(map(itemgetter(1), it)))
>            for k, it in groupby(z, itemgetter(0)))
> del(z)
> 
> f = open(sys.argv[2], 'w')
> for k,v in mydict.iteritems():
>     f.write(v + "\n")
> 
> f.close()

I don't expect that it matters much, but you don't need to sort your data if 
you use a dictionary anyway:

import csv
import sys

infile, outfile = sys.argv[1:]

d = {}
with open(infile, "rb") as instream:
    for key, value in csv.reader(instream, delimiter=';'):
        d.setdefault(key, [key]).append(value)

with open(outfile, "wb") as outstream:
    csv.writer(outstream).writerows(d.itervalues())

[toc] | [prev] | [next] | [standalone]

#3208

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-04-14 13:28 -0400
Message-ID	<mailman.366.1302802138.9059.python-list@python.org>
In reply to	#3202

On 4/14/2011 12:55 PM, Peter Otten wrote:

> I don't expect that it matters much, but you don't need to sort your data if
> you use a dictionary anyway:

Which means that one can build the dict line by line, as each is read, 
instead of reading the entire file into memory. So it does matter for 
intermediate memory use.

> import csv
> import sys
>
> infile, outfile = sys.argv[1:]
>
> d = {}
> with open(infile, "rb") as instream:
>      for key, value in csv.reader(instream, delimiter=';'):
>          d.setdefault(key, [key]).append(value)
>
> with open(outfile, "wb") as outstream:
>      csv.writer(outstream).writerows(d.itervalues())

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#3251

From	Peter Otten <__peter__@web.de>
Date	2011-04-15 10:15 +0200
Message-ID	<io8uq5$5id$1@solani.org>
In reply to	#3208

Terry Reedy wrote:

> On 4/14/2011 12:55 PM, Peter Otten wrote:
> 
>> I don't expect that it matters much, but you don't need to sort your data
>> if you use a dictionary anyway:
> 
> Which means that one can build the dict line by line, as each is read,
> instead of reading the entire file into memory. So it does matter for
> intermediate memory use.

Yes, sorry, that was a bit too much handwaving.

[toc] | [prev] | [next] | [standalone]

#3249

From	Algis Kabaila <akabaila@pcug.org.au>
Date	2011-04-15 18:01 +1000
Message-ID	<mailman.384.1302854925.9059.python-list@python.org>
In reply to	#3202

On Friday 15 April 2011 02:13:51 christian wrote:
> Hello,
> 
> i'm not very experienced in python. Is there a way doing
> below more memory efficient and maybe faster.
> I import a  2-column file and  then concat for every unique
> value in the first column ( key) the value from the second
> columns.
> 
> So The ouptut is something like that.
> A,1,2,3
> B,3,4
> C,9,10,11,12,90,34,322,21
> 
> 
> Thanks for advance & regards,
> Christian
> 
> 
> import csv
> import random
> import sys
> from itertools import groupby
> from operator import itemgetter
> 
> f=csv.reader(open(sys.argv[1]),delimiter=';')
> z=[[i[0],i[1]] for i in f]
> z.sort(key=itemgetter(0))
> mydict = dict((k,','.join(map(itemgetter(1), it)))
>            for k, it in groupby(z, itemgetter(0)))
> del(z)
> 
> f = open(sys.argv[2], 'w')
> for k,v in mydict.iteritems():
>     f.write(v + "\n")
> 
> f.close()
Two alternative solutions - the second one with generators is 
probably the  most economical as far as RAM usage is concerned.

For  you example data1.txt is taken as follows:
A, 1
B, 3
C, 9
A, 2
B, 4
C, 10
A, 3
C, 11
C, 12
C, 90
C, 34
C, 322
C, 21

The "two in one" program is:
#!/usr/bin python
'''generate.py - Example of reading long two column csv list and
sorting. Thread "memory usage multi value hash"
'''

# Determine a set of unique column 1 values
unique_set = set()
with open('data1.txt') as f:
    for line in f:
        unique_set.add(line.split(',')[0])
    print(unique_set)
with open('data1.txt') as f:
    for x in unique_set:
        ls = [line.split(',')[1].rstrip() for line in f if 
line.split(',')[0].rstrip() == x]
        print(x.rstrip(), ','.join(ls))
        f.seek(0)

print ('\n Alternative solution with generators')
with open('data1.txt') as f:
    for x in unique_set:
        gs = (line.split(',')[1].rstrip() for line in f if 
line.split(',')[0].rstrip() == x)
        s = ''
        for ds in gs:
            s = s + ds
        print(x.rstrip(), s)
        f.seek(0)

The output is:
{'A', 'C', 'B'}
A  1, 2, 3
C  9, 10, 11, 12, 90, 34, 322, 21
B  3, 4

 Alternative solution with generators
A  1 2 3
C  9 10 11 12 90 34 322 21
B  3 4

Notice that data sequence could be different, without any effect 
on output.

OldAl.

-- 
Algis
http://akabaila.pcug.org.au/StructuralAnalysis.pdf

[toc] | [prev] | [standalone]

csiph-web

memory usage multi value hash

Contents

#3202 — memory usage multi value hash

#3207

#3208

#3251

#3249