Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!ecngs!feeder.ecngs.de!xlned.com!feeder7.xlned.com!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Subject: Re: Reading Adobe PDF File
From: Adam Tauno Williams <awilliam@whitemice.org>
To: python-list@python.org
Date: Mon, 30 Jan 2012 08:22:13 -0500
In-Reply-To: <CAMZYqRQPzvuViHAaDxQTEQXgv3eZ-zrznyX=q41URXNbpNmWLQ@mail.gmail.com>
References: <a54dcb32-1ecd-4186-81a7-3a55c275c9b0@4g2000pbz.googlegroups.com> <CAMZYqRQPzvuViHAaDxQTEQXgv3eZ-zrznyX=q41URXNbpNmWLQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0
Precedence: list
Reply-To: awilliam@whitemice.org
Newsgroups: comp.lang.python
Message-ID: <mailman.5213.1327929838.27778.python-list@python.org>
Lines: 25
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:19599

On Sat, 2012-01-28 at 21:59 -0800, Chris Rebert wrote:
> On Sat, Jan 28, 2012 at 9:52 PM, Shrewd Investor <cltung@gmail.com> wrote:
> > I have a very large Adobe PDF file.  I was hoping to use a script to
> > extract the information for it.  Is there a way to loop through a PDF
> > file using Python?
> Haven't used it myself, but:
> http://www.unixuser.org/~euske/python/pdfminer/

It is very prone to hanging and/or crashing.  I haven't yet found a
really reliably way to read text from a PDF.

PyPDF provides a PdfFileReader class with an extractText method.  The
output is indeed the text although it can be a bit thorny to look at.

> > Or do I need to find a way to convert a PDF file into a text file?  If
> > so how?
> The pdf2txt.py script from the same package happens to do exactly this.


-- 
System & Network Administrator [ LPI & NCLA ]
<http://www.whitemiceconsulting.com>
OpenGroupware Developer <http://www.opengroupware.us>
Adam Tauno Williams