Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #89751

Re: Fast way of extracting files from various folders

Path csiph.com!usenet.pasdenom.info!news.redatomik.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'root': 0.05; '#print': 0.09; 'append': 0.09; 'except:': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:files': 0.09; 'try:': 0.09; 'def': 0.12; 'suggest': 0.14; 'windows': 0.15; '.txt': 0.16; 'apology': 0.16; 'caching': 0.16; 'edition.': 0.16; 'file"': 0.16; 'files:': 0.16; 'name)': 0.16; 'rarely': 0.16; 'received:80.91.229.3': 0.16; 'received:dip0.t-ipconnect.de': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-ipconnect.de': 0.16; 'subject:various': 0.16; 'younger': 0.16; 'wrote:': 0.18; 'solution.': 0.20; 'seems': 0.21; 'import': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'text,': 0.24; 'script': 0.25; 'second': 0.26; 'header:X -Complaints-To:1': 0.27; 'tried': 0.27; 'words': 0.29; "skip:' 10": 0.31; 'doc': 0.31; 'extract': 0.31; 'indentation': 0.31; 'python2.7': 0.31; 'file': 0.32; 'run': 0.32; 'text': 0.33; 'checking': 0.33; 'subject:from': 0.34; 'convert': 0.35; 'but': 0.35; 'there': 0.35; 'machine.': 0.36; 'error.': 0.37; 'too': 0.37; 'list.': 0.37; 'to:addr:python-list': 0.38; 'files': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'read': 0.60; 'documents,': 0.60; 'skip:t 30': 0.61; 'kindly': 0.61; 'first': 0.61; 'email addr:gmail.com': 0.63; 'name': 0.63; 'taking': 0.65; 'skip:w 30': 0.69; '"it': 0.84; 'fast?': 0.84; 'subject:Fast': 0.84
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Peter Otten <__peter__@web.de>
Subject Re: Fast way of extracting files from various folders
Date Sat, 02 May 2015 11:22:08 +0200
Organization None
References <6ba8934e-2f1a-4bcf-b72a-0dd276182ca2@googlegroups.com>
Mime-Version 1.0
Content-Type text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding 7Bit
X-Gmane-NNTP-Posting-Host p57bd9d18.dip0.t-ipconnect.de
User-Agent KNode/4.13.3
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3.1430558542.12865.python-list@python.org> (permalink)
Lines 48
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1430558542 news.xs4all.nl 2839 [2001:888:2000:d::a6]:36877
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:89751

Show key headers only | View raw


subhabrata.banerji@gmail.com wrote:

> I have several millions of documents in several folders and subfolders in
> my machine. I tried to write a script as follows, to extract all the .doc
> files and to convert them in text, but it seems it is taking too much of
> time.
> 
> import os
> from fnmatch import fnmatch
> import win32com.client
> import zipfile, re
> def listallfiles2(n):
>     root = 'C:\Cand_Res'
>     pattern = "*.doc"
>     list1=[]
>     for path, subdirs, files in os.walk(root):
>         for name in files:
>             if fnmatch(name, pattern):
>                 file_name1=os.path.join(path, name)
>                 if ".doc" in file_name1:
>                     #EXTRACTING ONLY .DOC FILES
>                     if ".docx" not in file_name1:
>                         #print "It is A Doc file$$:",file_name1
>                         try:
>                             doc = win32com.client.GetObject(file_name1)
>                             text = doc.Range().Text
>                             text1=text.encode('ascii','ignore')
>                             text_word=text1.split()
>                             #print "Text for Document File Is:",text1
>                             list1.append(text_word)
>                             print "It is a Doc file"
>                         except:
>                             print "DOC ISSUE"
> 
> But it seems it is taking too much of time, to convert to text and to
> append to list. Is there any way I may do it fast? I am using Python2.7 on
> Windows 7 Professional Edition. Apology for any indentation error.
> 
> If any one may kindly suggest a solution.

It will not help the first time through your documents, but if you write the 
words for the word documents in one .txt file per .doc, and the original 
files rarely change you can read from the .txt files when you run your 
script a second time. Just make sure that the .txt is younger than the 
corresponding .doc by checking the file time.

In short: use a caching strategy.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-01 05:28 -0700
  Re: Fast way of extracting files from various folders Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-05-01 18:36 +0200
  Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 02:00 -0700
  Re: Fast way of extracting files from various folders Peter Otten <__peter__@web.de> - 2015-05-02 11:22 +0200
    Re: Fast way of extracting files from various folders subhabrata.banerji@gmail.com - 2015-05-02 03:44 -0700

csiph-web