Path: csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Subject: Re: Opening multiple Files in Different Encoding
Date: Wed, 11 Jul 2012 18:24:01 -0400
Organization: > Bestiaria Support Staff <
References: <40633830-78ae-4cc6-8795-de5a352e0fb1@m2g2000pbv.googlegroups.com> <f3fa937e-94c8-4810-8600-82818ab7f10d@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2017.1342045454.4697.python-list@python.org>
Lines: 94
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:25181

On Wed, 11 Jul 2012 11:15:02 -0700 (PDT), subhabangalore@gmail.com
declaimed the following in gmane.comp.python.general:

> No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings. 
> 1) First I have to determine on the fly the file type.
> 2) I can not assign encoding="..." whatever be the encoding I have to read it.
>

	Many of those are (semi) proprietary formats (M$ Office <G>).

	DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
think that also implies UTF-8 (once you manage to decompress them)...
Note that, for a test, I renamed a .docx to .zip and opened it in
PowerArchiver... It generates 19 files in a multi-level tree -- one of
which is named
		[content_types].xml
and contains
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
<Override PartName="/docProps/app.xml"
ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/word/settings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/word/footer2.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/docProps/custom.xml"
ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/word/footer1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/word/theme/theme1.xml"
ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/word/fontTable.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/word/webSettings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/word/header1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

	That should also apply to the rest of the new Office document
formats.

	Plain DOC format could be a mishmash of three or four binary formats
(Word6 being the last compatible with 16-bit Windows 3.x Word). I
believe one Office version assigned DOC to what were really RTF format
files rather than the binary (yes, binary -- there is no guarantee that
you can find meaningful text without being able to parse a binary file
format).

	PDF contents can by binary compressed; again there is no guarantee
you can find meaningful text without being able to parse the contents.
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
(an older version than current standard, I suspect)... Heck, many of the
cheaper PDF conversions basically embed each page as a graphical
(bitmap) image, not as text.

	For the Office documents, if you are running on a Windows system (or
can open them in something like OpenOffice), your best chances are
likely to be programmatically open them in the application and then do a
"save as..." TXT (for Word) and CSV (for Excel) -- then process the
TXT/CSV files (or save as RTF if that is an option -- that's usually in
whatever the locale specific Windows code page contains, if not plain
ASCII).

	I believe there is a library to read Excel files directly:
http://pypi.python.org/pypi/xlrd/

	For PDF; I don't know if Acrobat Reader supports automation, to
programmatically load and "save as text".
http://p2p.wrox.com/vb-net-2002-2003-basics/39037-acrobat-reader-automation.html
implies an ability to automate on Windows, so using the win32 extension
library or ctypes may give you access to work with the files.


-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/