Groups > comp.lang.python > #25149 > unrolled thread

Opening multiple Files in Different Encoding

Started by	Subhabrata <subhabangalore@gmail.com>
First post	2012-07-10 10:46 -0700
Last post	2012-07-11 23:22 +0000
Articles	6 — 5 participants

Back to article view | Back to comp.lang.python

  Opening multiple Files in Different Encoding Subhabrata <subhabangalore@gmail.com> - 2012-07-10 10:46 -0700
    Re: Opening multiple Files in Different Encoding MRAB <python@mrabarnett.plus.com> - 2012-07-10 20:26 +0100
    Re: Opening multiple Files in Different Encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-11 06:22 +0000
    Re: Opening multiple Files in Different Encoding subhabangalore@gmail.com - 2012-07-11 11:15 -0700
      Re: Opening multiple Files in Different Encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-11 18:24 -0400
      Re: Opening multiple Files in Different Encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-11 23:22 +0000

#25149 — Opening multiple Files in Different Encoding

From	Subhabrata <subhabangalore@gmail.com>
Date	2012-07-10 10:46 -0700
Subject	Opening multiple Files in Different Encoding
Message-ID	<40633830-78ae-4cc6-8795-de5a352e0fb1@m2g2000pbv.googlegroups.com>

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards,
Subhabrata Banerjee.

[toc] | [next] | [standalone]

#25154

From	MRAB <python@mrabarnett.plus.com>
Date	2012-07-10 20:26 +0100
Message-ID	<mailman.2000.1341948384.4697.python-list@python.org>
In reply to	#25149

On 10/07/2012 18:46, Subhabrata wrote:
> Dear Group,
>
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
>
> If any one can help me out.I am using Python3.2 on Windows.
>
You could try different encodings. If it raises a UnicodeDecodeError,
then it's the wrong encoding, Otherwise just look at the decoding
result and see whether it "looks" OK.

I believe that one method is to look at the frequency distribution of
characters compared with sample texts.

[toc] | [prev] | [next] | [standalone]

#25165

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-07-11 06:22 +0000
Message-ID	<4ffd1b8a$0$1781$c3e8da3$76491128@news.astraweb.com>
In reply to	#25149

On Tue, 10 Jul 2012 10:46:08 -0700, Subhabrata wrote:

> Dear Group,
> 
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?

open('first file', encoding='uft-8')
open('second file', encoding='latin1')

How you decide which encoding to use is up to you. Perhaps you can keep a 
mapping of {filename: encoding} somewhere.

Or perhaps you can try auto-detecting the encodings. The chardet module 
should help you there.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#25177

From	subhabangalore@gmail.com
Date	2012-07-11 11:15 -0700
Message-ID	<f3fa937e-94c8-4810-8600-82818ab7f10d@googlegroups.com>
In reply to	#25149

On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
> Dear Group,
> 
> I kept a good number of files in a folder. Now I want to read all of
> them. They are in different formats and different encoding. Using
> listdir/glob.glob I am able to find the list but how to open/read or
> process them for different encodings?
> 
> If any one can help me out.I am using Python3.2 on Windows.
> 
> Regards,
> Subhabrata Banerjee.
Dear Group,

No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings. 
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.

Any idea. Thinking.

Thanks in Advance,
Regards,
Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]

#25181

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-07-11 18:24 -0400
Message-ID	<mailman.2017.1342045454.4697.python-list@python.org>
In reply to	#25177

On Wed, 11 Jul 2012 11:15:02 -0700 (PDT), subhabangalore@gmail.com
declaimed the following in gmane.comp.python.general:

> No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings. 
> 1) First I have to determine on the fly the file type.
> 2) I can not assign encoding="..." whatever be the encoding I have to read it.
>

	Many of those are (semi) proprietary formats (M$ Office <G>).

	DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
think that also implies UTF-8 (once you manage to decompress them)...
Note that, for a test, I renamed a .docx to .zip and opened it in
PowerArchiver... It generates 19 files in a multi-level tree -- one of
which is named
		[content_types].xml
and contains
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
<Override PartName="/docProps/app.xml"
ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/word/settings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/word/footer2.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/docProps/custom.xml"
ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/word/footer1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/word/theme/theme1.xml"
ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/word/fontTable.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/word/webSettings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/word/header1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

	That should also apply to the rest of the new Office document
formats.

	Plain DOC format could be a mishmash of three or four binary formats
(Word6 being the last compatible with 16-bit Windows 3.x Word). I
believe one Office version assigned DOC to what were really RTF format
files rather than the binary (yes, binary -- there is no guarantee that
you can find meaningful text without being able to parse a binary file
format).

	PDF contents can by binary compressed; again there is no guarantee
you can find meaningful text without being able to parse the contents.
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
(an older version than current standard, I suspect)... Heck, many of the
cheaper PDF conversions basically embed each page as a graphical
(bitmap) image, not as text.

	For the Office documents, if you are running on a Windows system (or
can open them in something like OpenOffice), your best chances are
likely to be programmatically open them in the application and then do a
"save as..." TXT (for Word) and CSV (for Excel) -- then process the
TXT/CSV files (or save as RTF if that is an option -- that's usually in
whatever the locale specific Windows code page contains, if not plain
ASCII).

	I believe there is a library to read Excel files directly:
http://pypi.python.org/pypi/xlrd/

	For PDF; I don't know if Acrobat Reader supports automation, to
programmatically load and "save as text".
http://p2p.wrox.com/vb-net-2002-2003-basics/39037-acrobat-reader-automation.html
implies an ability to automate on Windows, so using the win32 extension
library or ctypes may give you access to work with the files.


-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#25182

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-07-11 23:22 +0000
Message-ID	<4ffe0ad0$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to	#25177

On Wed, 11 Jul 2012 11:15:02 -0700, subhabangalore wrote:

> On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
>> Dear Group,
>> 
>> I kept a good number of files in a folder. Now I want to read all of
>> them. They are in different formats and different encoding. Using
>> listdir/glob.glob I am able to find the list but how to open/read or
>> process them for different encodings?
>> 
>> If any one can help me out.I am using Python3.2 on Windows.
>> 
>> Regards,
>> Subhabrata Banerjee.
> Dear Group,
> 
> No generally I know the glob.glob or the encodings as I work lot on
> non-ASCII stuff, but I recently found an interesting issue, suppose
> there are .doc,.docx,.txt,.xls,.pdf files with different encodings. 

You can have text files with different encodings, but not the others.

.doc .docx .xls and .pdf are all binary files. You don't specify an 
encoding when you read them, because they aren't text -- encodings are 
for mapping bytes to text, not bytes to binary formats.

In particular, .docx is compressed XML, so once you have uncompressed it, 
the contents XML, which is *always* UTF-8.

> 1) First I have to determine on the fly the file type. 

Which is a different problem from your first post.

On Windows, you determine the file type using the file extension.

import os
name, ext = os.path.splitext("my_file_name.bmp")

will give you ext = ".bmp".

Then what do you expect to do? You can open the file as a binary blob, 
but what do you expect then?

f = open("my_file_name.bmp", "rb")

Now what do you want to do with it?

> 2) I can not assign
> encoding="..." whatever be the encoding I have to read it.

You can't set the encoding when you open files in binary mode, but binary 
files don't have an encoding.

-- 
Steven

[toc] | [prev] | [standalone]

csiph-web

Opening multiple Files in Different Encoding

Contents

#25149 — Opening multiple Files in Different Encoding

#25154

#25165

#25177

#25181

#25182