Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Dan Strohl <D.Strohl@F5.com>
Newsgroups: comp.lang.python
Subject: RE: Script to extract text from PDF files
Date: Fri, 6 Nov 2015 22:46:01 +0000
Lines: 79
Message-ID: <mailman.131.1446996921.16136.python-list@python.org>
References: <fdbh95$smc$1@solaris.cc.vt.edu> <ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Thread-Topic: Script to extract text from PDF files
Thread-Index: AQHRGOLfevuYC/4nPk+Je3OtezMyoJ6Pl0eg
In-Reply-To: <ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com>
Accept-Language: en-US
Content-Language: en-US
Precedence: list
Xref: csiph.com comp.lang.python:98456

Its possible (likely) that I came into this in the middle, so sorry if this=
 was already thrown out... but have you looked at any of the following sugg=
estions?

https://pypi.python.org/pypi?%3Aaction=3Dsearch&term=3Dpdf+convert&submit=
=3Dsearch
http://stackoverflow.com/questions/6413441/python-pdf-library
https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167



-----Original Message-----
From: Python-list [mailto:python-list-bounces+d.strohl=3Df5.com@python.org]=
 On Behalf Of Scott Werner
Sent: Friday, November 06, 2015 2:30 PM
To: python-list@python.org
Subject: Re: Script to extract text from PDF files

On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote:
> I have a very crude Python script that extracts text from some (and I=20
> emphasize some) PDF documents. On many PDF docs, I cannot extract=20
> text, but this is because I'm doing something wrong. The PDF spec is=20
> large and complex and there are various ways in which to store and=20
> encode text. I wanted to post here and ask if anyone is interested in=20
> helping make the script better which means it should accurately=20
> extract text from most any pdf file... not just some.
>=20
> I know the topic of reading/extracting the text from a PDF document=20
> natively in Python comes up every now and then on comp.lang.python...
> I've posted about it in the past myself. After searching for other=20
> solutions, I've resorted to attempting this on my own in my spare time.
> Using apps external to Python (pdftotext, etc.) is not really an=20
> option for me. If someone knows of a free native Python app that does=20
> this now, let me know and I'll use that instead!
>=20
> So, if other more experienced programmer are interested in helping=20
> make the script better, please let me know. I can host a website and=20
> the latest revision and do all of the grunt work.
>=20
> Thanks,
>=20
> Brad

As mentioned before, extracting plain text from a PDF document can be hit o=
r miss. I have tried all the following applications (free/open source) on A=
rch Linux. Note, I would execute the commands with subprocess and capture s=
tdout or read plain text file created by the application.

* textract (uses pdftotext)
- https://github.com/deanmalmgren/textract

* pdftotext
- http://poppler.freedesktop.org/
- cmd: pdftotext -layout "/path/to/document.pdf" -
- cmd: pdftotext "/path/to/document.pdf" -

* Calibre
- http://calibre-ebook.com/
- cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chap=
ters-in-toc

* AbiWord
- http://www.abiword.org/
- cmd: abiword --to-name=3Dfd://1 --to-TXT "/path/to/document.pdf"

* Apache Tika
- https://tika.apache.org/
- cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-=
main "/path/to/document.pdf"

For my application, I saw the best results using Apache Tika. However, I do=
 still encounter strange encoding or extraction issues, e.g. S P A C E D  O=
 U T  H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of=
 repairing/cleaning methods.

I welcome an improved solution that has some intelligence like comparing th=
e extract plain text order to a snapshot of the pdf page using OCR.
--
https://mail.python.org/mailman/listinfo/python-list