Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder3.xlned.com!newsfeed.xs4all.nl!newsfeed1a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: Re: Finding size of Variable
Date: Wed, 05 Feb 2014 09:27:15 +0100
Organization: None
References: <8e4c1ab1-e65d-483f-ad9d-6933ae2052c3@googlegroups.com> <mailman.6402.1391541507.18130.python-list@python.org> <723729ee-8e74-4d65-aa6f-742051a94101@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
User-Agent: KNode/4.7.3
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.6417.1391588841.18130.python-list@python.org>
Lines: 137
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:65473

Ayushi Dalmia wrote:

> On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
>> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>> 
>> 
>> 
>> > 
>> 
>> > Where am I going wrong? What are the alternatives I can try?
>> 
>> 
>> 
>> You've rejected all the alternatives so far without showing your
>> 
>>  code, or even properly specifying your problem.
>> 
>> 
>> 
>> To get the "total" size of a list of strings,  try (untested):
>> 
>> 
>> 
>> a = sys.getsizeof (mylist )
>> 
>> for item in mylist:
>> 
>>     a += sys.getsizeof (item)
>> 
>> 
>> 
>> This can be high if some of the strings are interned and get
>> 
>>  counted twice. But you're not likely to get closer without some
>> 
>>  knowledge of the data objects and where they come
>> 
>>  from.
>> 
>> 
>> 
>> --
>> 
>> DaveA
> 
> Hello Dave,
> 
> I just thought that saving others time is better and hence I explained
> only the subset of my problem. Here is what I am trying to do:
> 
> I am trying to index the current wikipedia dump without using databases
> and create a search engine for Wikipedia documents. Note, I CANNOT USE
> DATABASES. My approach:
> 
> I am parsing the wikipedia pages using SAX Parser, and then, I am dumping
> the words along with the posting list (a list of doc ids in which the word
> is present) into different files after reading 'X' number of pages. Now
> these files may have the same word and hence I need to merge them and
> write the final index again. Now these final indexes must be of limited
> size as I need to be of limited size. This is where I am stuck. I need to
> know how to determine the size of content in a variable before I write
> into the file.
> 
> Here is the code for my merging:
> 
> def mergeFiles(pathOfFolder, countFile):
>     listOfWords={}
>     indexFile={}
>     topOfFile={}
>     flag=[0]*countFile
>     data=defaultdict(list)
>     heap=[]
>     countFinalFile=0
>     for i in xrange(countFile):
>         fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
>         indexFile[i]= bz2.BZ2File(fileName, 'rb')
>         flag[i]=1
>         topOfFile[i]=indexFile[i].readline().strip()
>         listOfWords[i] = topOfFile[i].split(' ')
>         if listOfWords[i][0] not in heap:
>             heapq.heappush(heap, listOfWords[i][0])

At this point you have already done it wrong as your heap contains the 
complete data and you have done a lot of O(N) tests on the heap. 
This is both slow and consumes a lot of memory. See

http://code.activestate.com/recipes/491285-iterator-merge/

for a sane way to merge sorted data from multiple files.  Your code becomes 
(untested)

with open("outfile.txt", "wb") as outfile:

    infiles = []
    for i in xrange(countFile):
        filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2')
        infiles.append(bz2.BZ2File(filename, "rb"))

    outfile.writelines(imerge(*infiles))

    for infile in infiles:
        infile.close()

Once you have your data in a single file you can read from that file and do 
the postprocessing you mention below.

             
>     while any(flag)==1:
>         temp = heapq.heappop(heap)
>         for i in xrange(countFile):
>             if flag[i]==1:
>                 if listOfWords[i][0]==temp:
> 
>                     //This is where I am stuck. I cannot wait until memory
>                     //error, as I need to do some postprocessing too. try:
>                         data[temp].extend(listOfWords[i][1:])
>                     except MemoryError:
>                         writeFinalIndex(data, countFinalFile,
>                         pathOfFolder) data=defaultdict(list)
>                         countFinalFile+=1
> 
>                     topOfFile[i]=indexFile[i].readline().strip()
>                     if topOfFile[i]=='':
>                             flag[i]=0
>                             indexFile[i].close()
>                             
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
>                     else:
>                         listOfWords[i] = topOfFile[i].split(' ')
>                         if listOfWords[i][0] not in heap:
>                             heapq.heappush(heap, listOfWords[i][0])
>     writeFinalIndex(data, countFinalFile, pathOfFolder)
> 
> countFile is the number of files and writeFileIndex method writes into the
> file.