Groups > comp.lang.python > #89728 > unrolled thread

Fast way of extracting files from various folders

Started by	subhabrata.banerji@gmail.com
First post	2015-05-01 05:28 -0700
Last post	2015-05-02 03:44 -0700
Articles	5 — 3 participants

Back to article view | Back to comp.lang.python

  Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-01 05:28 -0700
    Re: Fast way of extracting files from various folders Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-05-01 18:36 +0200
    Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 02:00 -0700
    Re: Fast way of extracting files from various folders Peter Otten <__peter__@web.de> - 2015-05-02 11:22 +0200
      Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 03:44 -0700

#89728 — Fast way of extracting files from various folders

From	subhabrata.banerji@gmail.com
Date	2015-05-01 05:28 -0700
Subject	Fast way of extracting files from various folders
Message-ID	<6ba8934e-2f1a-4bcf-b72a-0dd276182ca2@googlegroups.com>

Dear Group,

I have several millions of documents in several folders and subfolders in my machine.
I tried to write a script as follows, to extract all the .doc files and to convert them in text, but it seems it is taking too much of time. 

import os
from fnmatch import fnmatch
import win32com.client
import zipfile, re
def listallfiles2(n):
    root = 'C:\Cand_Res'
    pattern = "*.doc"
    list1=[]
    for path, subdirs, files in os.walk(root):
        for name in files:
            if fnmatch(name, pattern):
                file_name1=os.path.join(path, name)
                if ".doc" in file_name1:
                    #EXTRACTING ONLY .DOC FILES
                    if ".docx" not in file_name1:
                        #print "It is A Doc file$$:",file_name1
                        try:
                            doc = win32com.client.GetObject(file_name1)
                            text = doc.Range().Text
                            text1=text.encode('ascii','ignore')
                            text_word=text1.split()
                            #print "Text for Document File Is:",text1
                            list1.append(text_word)
                            print "It is a Doc file"
                        except:
                            print "DOC ISSUE"

But it seems it is taking too much of time, to convert to text and to append to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 Professional Edition. Apology for any indentation error. 

If any one may kindly suggest a solution.

Regards,
Subhabrata Banerjee.

[toc] | [next] | [standalone]

#89742

From	Irmen de Jong <irmen.NOSPAM@xs4all.nl>
Date	2015-05-01 18:36 +0200
Message-ID	<5543ab93$0$2901$e4fe514c@news.xs4all.nl>
In reply to	#89728

On 1-5-2015 14:28, subhabrata.banerji@gmail.com wrote:
> Dear Group,
> 
> I have several millions of documents in several folders and subfolders in my machine.
> I tried to write a script as follows, to extract all the .doc files and to convert them in text, but it seems it is taking too much of time. 
> 

[snip]

> But it seems it is taking too much of time, to convert to text and to append to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 Professional Edition. Apology for any indentation error. 
> 
> If any one may kindly suggest a solution.

Have you profiled and identified the part of your script that is slow?

On first sight though your python code, while not optimal, contains no immediate
performance issues. It is likely the COM interop call to Winword and getting the text
via that interface that is slow. Imagine opening word for "several million documents",
no wonder it doesn't perform.

Investigate tools like antiword, wv, docx2txt. I suspect they're quite a bit faster than
relying on Word itself.

Irmen

[toc] | [prev] | [next] | [standalone]

#89748

From	subhabrata.banerji@gmail.com
Date	2015-05-02 02:00 -0700
Message-ID	<fc6342cd-1bcd-4f66-903b-1e99f8fbd4a1@googlegroups.com>
In reply to	#89728

On Friday, May 1, 2015 at 5:58:50 PM UTC+5:30, subhabrat...@gmail.com wrote:
> Dear Group,
> 
> I have several millions of documents in several folders and subfolders in my machine.
> I tried to write a script as follows, to extract all the .doc files and to convert them in text, but it seems it is taking too much of time. 
> 
> import os
> from fnmatch import fnmatch
> import win32com.client
> import zipfile, re
> def listallfiles2(n):
>     root = 'C:\Cand_Res'
>     pattern = "*.doc"
>     list1=[]
>     for path, subdirs, files in os.walk(root):
>         for name in files:
>             if fnmatch(name, pattern):
>                 file_name1=os.path.join(path, name)
>                 if ".doc" in file_name1:
>                     #EXTRACTING ONLY .DOC FILES
>                     if ".docx" not in file_name1:
>                         #print "It is A Doc file$$:",file_name1
>                         try:
>                             doc = win32com.client.GetObject(file_name1)
>                             text = doc.Range().Text
>                             text1=text.encode('ascii','ignore')
>                             text_word=text1.split()
>                             #print "Text for Document File Is:",text1
>                             list1.append(text_word)
>                             print "It is a Doc file"
>                         except:
>                             print "DOC ISSUE"
> 
> But it seems it is taking too much of time, to convert to text and to append to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 Professional Edition. Apology for any indentation error. 
> 
> If any one may kindly suggest a solution.
> 
> Regards,
> Subhabrata Banerjee.

Thanks. You are right conversions are taking time. I would surely check. Rest part is okay. Regards, Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]

#89751

From	Peter Otten <__peter__@web.de>
Date	2015-05-02 11:22 +0200
Message-ID	<mailman.3.1430558542.12865.python-list@python.org>
In reply to	#89728

subhabrata.banerji@gmail.com wrote:

> I have several millions of documents in several folders and subfolders in
> my machine. I tried to write a script as follows, to extract all the .doc
> files and to convert them in text, but it seems it is taking too much of
> time.
> 
> import os
> from fnmatch import fnmatch
> import win32com.client
> import zipfile, re
> def listallfiles2(n):
>     root = 'C:\Cand_Res'
>     pattern = "*.doc"
>     list1=[]
>     for path, subdirs, files in os.walk(root):
>         for name in files:
>             if fnmatch(name, pattern):
>                 file_name1=os.path.join(path, name)
>                 if ".doc" in file_name1:
>                     #EXTRACTING ONLY .DOC FILES
>                     if ".docx" not in file_name1:
>                         #print "It is A Doc file$$:",file_name1
>                         try:
>                             doc = win32com.client.GetObject(file_name1)
>                             text = doc.Range().Text
>                             text1=text.encode('ascii','ignore')
>                             text_word=text1.split()
>                             #print "Text for Document File Is:",text1
>                             list1.append(text_word)
>                             print "It is a Doc file"
>                         except:
>                             print "DOC ISSUE"
> 
> But it seems it is taking too much of time, to convert to text and to
> append to list. Is there any way I may do it fast? I am using Python2.7 on
> Windows 7 Professional Edition. Apology for any indentation error.
> 
> If any one may kindly suggest a solution.

It will not help the first time through your documents, but if you write the 
words for the word documents in one .txt file per .doc, and the original 
files rarely change you can read from the .txt files when you run your 
script a second time. Just make sure that the .txt is younger than the 
corresponding .doc by checking the file time.

In short: use a caching strategy.

[toc] | [prev] | [next] | [standalone]

#89760

From	subhabrata.banerji@gmail.com
Date	2015-05-02 03:44 -0700
Message-ID	<19811322-c2af-44e0-b84b-fd97127a9f0c@googlegroups.com>
In reply to	#89751

On Saturday, May 2, 2015 at 2:52:32 PM UTC+5:30, Peter Otten wrote:
>  wrote:
> 
> > I have several millions of documents in several folders and subfolders in
> > my machine. I tried to write a script as follows, to extract all the .doc
> > files and to convert them in text, but it seems it is taking too much of
> > time.
> > 
> > import os
> > from fnmatch import fnmatch
> > import win32com.client
> > import zipfile, re
> > def listallfiles2(n):
> >     root = 'C:\Cand_Res'
> >     pattern = "*.doc"
> >     list1=[]
> >     for path, subdirs, files in os.walk(root):
> >         for name in files:
> >             if fnmatch(name, pattern):
> >                 file_name1=os.path.join(path, name)
> >                 if ".doc" in file_name1:
> >                     #EXTRACTING ONLY .DOC FILES
> >                     if ".docx" not in file_name1:
> >                         #print "It is A Doc file$$:",file_name1
> >                         try:
> >                             doc = win32com.client.GetObject(file_name1)
> >                             text = doc.Range().Text
> >                             text1=text.encode('ascii','ignore')
> >                             text_word=text1.split()
> >                             #print "Text for Document File Is:",text1
> >                             list1.append(text_word)
> >                             print "It is a Doc file"
> >                         except:
> >                             print "DOC ISSUE"
> > 
> > But it seems it is taking too much of time, to convert to text and to
> > append to list. Is there any way I may do it fast? I am using Python2.7 on
> > Windows 7 Professional Edition. Apology for any indentation error.
> > 
> > If any one may kindly suggest a solution.
> 
> It will not help the first time through your documents, but if you write the 
> words for the word documents in one .txt file per .doc, and the original 
> files rarely change you can read from the .txt files when you run your 
> script a second time. Just make sure that the .txt is younger than the 
> corresponding .doc by checking the file time.
> 
> In short: use a caching strategy.

Thanks Peter. I'll surely check on that. Regards, Subhabrata Banerjee.

[toc] | [prev] | [standalone]

csiph-web

Fast way of extracting files from various folders

Contents

#89728 — Fast way of extracting files from various folders

#89742

#89748

#89751

#89760