Re: Fast way of extracting files from various folders

Newsgroups	comp.lang.python
Date	2015-05-02 03:44 -0700
References	<6ba8934e-2f1a-4bcf-b72a-0dd276182ca2@googlegroups.com> <mailman.3.1430558542.12865.python-list@python.org>
Message-ID	<19811322-c2af-44e0-b84b-fd97127a9f0c@googlegroups.com> (permalink)
Subject	Re: Fast way of extracting files from various folders
From	subhabrata.banerji@gmail.com

Show all headers | View raw

On Saturday, May 2, 2015 at 2:52:32 PM UTC+5:30, Peter Otten wrote:
>  wrote:
> 
> > I have several millions of documents in several folders and subfolders in
> > my machine. I tried to write a script as follows, to extract all the .doc
> > files and to convert them in text, but it seems it is taking too much of
> > time.
> > 
> > import os
> > from fnmatch import fnmatch
> > import win32com.client
> > import zipfile, re
> > def listallfiles2(n):
> >     root = 'C:\Cand_Res'
> >     pattern = "*.doc"
> >     list1=[]
> >     for path, subdirs, files in os.walk(root):
> >         for name in files:
> >             if fnmatch(name, pattern):
> >                 file_name1=os.path.join(path, name)
> >                 if ".doc" in file_name1:
> >                     #EXTRACTING ONLY .DOC FILES
> >                     if ".docx" not in file_name1:
> >                         #print "It is A Doc file$$:",file_name1
> >                         try:
> >                             doc = win32com.client.GetObject(file_name1)
> >                             text = doc.Range().Text
> >                             text1=text.encode('ascii','ignore')
> >                             text_word=text1.split()
> >                             #print "Text for Document File Is:",text1
> >                             list1.append(text_word)
> >                             print "It is a Doc file"
> >                         except:
> >                             print "DOC ISSUE"
> > 
> > But it seems it is taking too much of time, to convert to text and to
> > append to list. Is there any way I may do it fast? I am using Python2.7 on
> > Windows 7 Professional Edition. Apology for any indentation error.
> > 
> > If any one may kindly suggest a solution.
> 
> It will not help the first time through your documents, but if you write the 
> words for the word documents in one .txt file per .doc, and the original 
> files rarely change you can read from the .txt files when you run your 
> script a second time. Just make sure that the .txt is younger than the 
> corresponding .doc by checking the file time.
> 
> In short: use a caching strategy.

Thanks Peter. I'll surely check on that. Regards, Subhabrata Banerjee.

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-01 05:28 -0700
  Re: Fast way of extracting files from various folders Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-05-01 18:36 +0200
  Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 02:00 -0700
  Re: Fast way of extracting files from various folders Peter Otten <__peter__@web.de> - 2015-05-02 11:22 +0200
    Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 03:44 -0700

csiph-web