Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #49124

io module and pdf question

Path csiph.com!usenet.pasdenom.info!news.albasani.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <jyoung79@kc.rr.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'encoding': 0.05; 'insert': 0.05; 'continuation': 0.07; 'encoded': 0.07; 'received:mail.rr.com': 0.07; 'subject:module': 0.09; 'works.': 0.09; 'subject:question': 0.10; 'python': 0.11; 'assume': 0.14; '2.7': 0.14; "'r',": 0.16; '(when': 0.16; 'codec': 0.16; 'errors:': 0.16; 'loops': 0.16; 'module?': 0.16; 'modules,': 0.16; 'received:10.128': 0.16; 'received:71.74': 0.16; 'received:71.74.56': 0.16; 'received:hrndva-omtalb.mail.rr.com': 0.16; 'skip:" 70': 0.16; 'module': 0.19; 'trying': 0.19; 'file,': 0.19; 'import': 0.22; 'install': 0.23; 'this?': 0.23; 'byte': 0.24; 'module,': 0.24; 'file.': 0.24; 'looks': 0.24; "i've": 0.25; 'help!': 0.26; 'this:': 0.26; 'idea': 0.28; "i'm": 0.30; "skip:' 10": 0.31; 'file': 0.32; 'this.': 0.32; 'run': 0.32; 'another': 0.32; 'received:rr.com': 0.33; 'could': 0.34; "can't": 0.35; 'something': 0.35; 'but': 0.35; 'there': 0.35; 'done': 0.36; 'thanks': 0.36; 'should': 0.36; 'received:10': 0.37; 'to:addr :python-list': 0.38; 'files': 0.38; 'skip:- 10': 0.38; 'pdf': 0.39; 'received:71': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'skip:u 10': 0.60; 'read': 0.60; 'greatest': 0.60; 'from:no real name:2**0': 0.61; 'break': 0.61; 'chance': 0.65; 'insight': 0.68; 'invalid': 0.68; 'safe': 0.72; 'hoping': 0.75; '10:': 0.84; 'guts': 0.84; 'clueless': 0.91
Authentication-Results hrndva-omtalb.mail.rr.com smtp.user=jyoung79@kc.rr.com; auth=pass (LOGIN)
X-Authority-Analysis v=2.0 cv=Tr1kdUrh c=1 sm=0 a=05ChyHeVI94A:10 a=IkcTkHD0fZMA:10 a=ayC55rCoAAAA:8 a=KGjhK52YXX0A:10 a=cKvwOixTp0YA:10 a=9kYicGoz5etVJPX24esA:9 a=QEXdDO2ut3YA:10 a=+B7pjayfiqa65JLd6hsdQw==:117
X-Cloudmark-Score 0
X-Authenticated-User jyoung79@kc.rr.com
Date Tue, 25 Jun 2013 4:18:44 +0000
From <jyoung79@kc.rr.com>
To python-list@python.org
Subject io module and pdf question
MIME-Version 1.0
Content-Type text/plain; charset=utf-8
Content-Transfer-Encoding 7bit
X-Priority 3 (Normal)
Sensitivity Normal
X-Originating-IP
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3795.1372133932.3114.python-list@python.org> (permalink)
Lines 41
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1372133932 news.xs4all.nl 15870 [2001:888:2000:d::a6]:35755
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:49124

Show key headers only | View raw


Would like to get your opinion on this.  Currently to get the metadata out of a pdf file, I loop through the guts of the file.  I know it's not the greatest idea to do this, but I'm trying to avoid extra modules, etc.

Adobe javascript was used to insert the metadata, so the added data looks something like this:

XYZ:colorList="DarkBlue,Yellow"

With python 2.7, it successfully loops through the file contents and I'm able to find the line that contains "XYZ:colorList".

However, when I try to run it with python 3, it errors:

  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

I've done some research on this, and it looks like encoding it to latin-1 works.  I also found that if I use the io module, it will work on both python 2.7 and 3.3.  For example:

--------------
import io
import os

pdfPath = '~/Desktop/test.pdf'

colorlistData = ''

with io.open(os.path.expanduser(pdfPath), 'r', encoding='latin-1') as f:
    for i in f:
        if 'XYZ:colorList' in i:
            colorlistData = i.split('XYZ:colorList')[1]
            break

print(colorlistData)
--------------

As you can tell, I'm clueless in how exactly this works and am hoping someone can give me some insight on:
1. Is there another way to get metadata out of a pdf without having to install another module?
2. Is it safe to assume pdf files should always be encoded as latin-1 (when trying to read it this way)?  Is there a chance they could be something else?
3. Is the io module a good way to pursue this?

Thanks for your help!

Jay

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

io module and pdf question <jyoung79@kc.rr.com> - 2013-06-25 04:18 +0000
  Re: io module and pdf question rusi <rustompmody@gmail.com> - 2013-06-24 23:33 -0700
    Re: io module and pdf question Christian Gollwitzer <auriocus@gmx.de> - 2013-06-25 09:18 +0200
  Re: io module and pdf question wxjmfauth@gmail.com - 2013-06-26 07:11 -0700

csiph-web