RE: Script to extract text from PDF files

Path	csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From	Dan Strohl <D.Strohl@F5.com>
Newsgroups	comp.lang.python
Subject	RE: Script to extract text from PDF files
Date	Fri, 6 Nov 2015 22:46:01 +0000
Lines	79
Message-ID	<mailman.131.1446996921.16136.python-list@python.org> (permalink)
References	<fdbh95$smc$1@solaris.cc.vt.edu> <ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com>
Mime-Version	1.0
Content-Type	text/plain; charset="us-ascii"
Content-Transfer-Encoding	quoted-printable
X-Trace	news.uni-berlin.de GOvJuHuZtiyMAP7jvUC0vAwuNbzPhEOHXWGlq/SnuFdw==
Return-Path	<D.Strohl@f5.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.003
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'url:pypi': 0.03; 'subject:text': 0.04; 'revision': 0.05; 'friday,': 0.07; 'brad': 0.09; 'encode': 0.09; 'spec': 0.09; 'stdout': 0.09; 'subject:files': 0.09; 'thrown': 0.09; 'url:apache': 0.09; 'url:github': 0.09; 'python': 0.10; 'apache': 0.14; 'apps': 0.15; 'encoding': 0.15; 'message-----': 0.15; 'crude': 0.16; 'middle,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'skip:[ 50': 0.16; 'some)': 0.16; 'subprocess': 0.16; 'url:freedesktop': 0.16; 'url:submit': 0.16; 'wrote:': 0.16; 'app': 0.16; 'looked': 0.16; 'comparing': 0.18; 'programmer': 0.18; '2015': 0.20; 'skip:" 30': 0.20; 'to:name:python-list@python.org': 0.20; 'posted': 0.21; 'to:2**1': 0.21; 'text,': 0.22; 'thanks,': 0.24; 'tried': 0.24; 'plain': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'skip:" 20': 0.26; 'external': 0.27; 'host': 0.28; 'attempting': 0.29; 'methods.': 0.29; 'received:192.168.10': 0.29; "i'm": 0.30; 'url:mailman': 0.30; 'work.': 0.30; 'e.g.': 0.30; 'option': 0.31; 'post': 0.31; 'anyone': 0.32; 'knows': 0.32; 'etc.)': 0.32; 'topic': 0.32; 'url:python': 0.33; 'extract': 0.33; "i'll": 0.33; 'skip:- 10': 0.34; 'url:listinfo': 0.34; 'know.': 0.34; 'file': 0.34; 'so,': 0.35; 'sent:': 0.35; 'text': 0.35; 'text.': 0.35; 'url:%3a': 0.35; 'something': 0.35; 'subject:': 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'url:org': 0.36; 'created': 0.36; 'possible': 0.36; '(and': 0.36; 'email addr:python.org': 0.36; 'url:action': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'pdf': 0.37; 'charset :us-ascii': 0.37; 'wanted': 0.37; 'doing': 0.38; '2007': 0.38; 'someone': 0.38; 'files': 0.38; 'means': 0.39; 'skip:- 20': 0.39; 'does': 0.39; 'subject:from': 0.39; 'from:': 0.39; 'received:192': 0.39; 'url:mail': 0.40; 'to:addr:python.org': 0.40; 'still': 0.40; 'some': 0.40; 'improved': 0.63; 'strange': 0.63; 'more': 0.63; 'latest': 0.64; 'url:pdf': 0.64; 'capture': 0.66; 'python-list': 0.66; 'here': 0.66; 'results': 0.66; 'email name:python-list': 0.67; 'helping': 0.67; 'saw': 0.77; 'url:search': 0.83; '2:30': 0.84; 'accurately': 0.84; 'miss.': 0.84; 'received:192.168.15': 0.84; 'snapshot': 0.84; 'tika': 0.84; 'url:167': 0.84; 'utc-4,': 0.84; 'url:tutorial': 0.91; 'scott': 0.93
DKIM-Signature	v=1; a=rsa-sha256; c=simple/simple; d=f5.com; i=@f5.com; q=dns/txt; s=seattle; t=1446849979; x=1478385979; h=from:to:subject:date:message-id:references:in-reply-to: content-transfer-encoding:mime-version; bh=5HIuOH45nZVMntLNDKpssJcfC2Q2uCJCrQBBprJp2VE=; b=jERrL096lvuslq4dCCqD/2Yvw6utr1vFynTwjl803xcw2iOhv6MVCTNJ w0sQYGF9UlJMl/EqCtNcgmcxLdKQgKHJCmbmrKt+PZ/srm15P2L4WaBdO eTHX+kvXjfUi6+S6IlbbtQ43jnZtjnANzvUqG0Nv34Y7KCcygnd2WFYEd c=;
X-IronPort-AV	E=Sophos;i="5.20,254,1444694400"; d="scan'208";a="187473001"
X-IPAS-Result	A2G3BACCLT1W/+sKqMBeGQEBAQEPAQEBAQYBAQEBg1RvBr9QIRkHFwyFI0oCggcBAQEBAQEEgQeCNywQAQEBAQEBAQEBIwEBAQEBAQEBAQEBAQEcAg1eAQEBAQMBAQE3SwQCAQgRBAEBAR4JBycLFAkIAgQBEggBEoggwRABAQEBAQUBAQEBAQEBAQEahlSDeIEGhDsBAYR8BY0biS2FHYJwhRGBYkmDd4MlkwSEZ3IBg1I6gQcBAQE
Thread-Topic	Script to extract text from PDF files
Thread-Index	AQHRGOLfevuYC/4nPk+Je3OtezMyoJ6Pl0eg
In-Reply-To	<ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com>
Accept-Language	en-US
Content-Language	en-US
X-MS-Has-Attach
X-MS-TNEF-Correlator
x-ms-exchange-transport-fromentityheader	Hosted
x-originating-ip	[192.168.15.239]
X-Mailman-Approved-At	Sun, 08 Nov 2015 10:35:19 -0500
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.20+
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref	csiph.com comp.lang.python:98456

Show key headers only | View raw

Its possible (likely) that I came into this in the middle, so sorry if this was already thrown out... but have you looked at any of the following suggestions?

https://pypi.python.org/pypi?%3Aaction=search&term=pdf+convert&submit=search
http://stackoverflow.com/questions/6413441/python-pdf-library
https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167



-----Original Message-----
From: Python-list [mailto:python-list-bounces+d.strohl=f5.com@python.org] On Behalf Of Scott Werner
Sent: Friday, November 06, 2015 2:30 PM
To: python-list@python.org
Subject: Re: Script to extract text from PDF files

On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote:
> I have a very crude Python script that extracts text from some (and I 
> emphasize some) PDF documents. On many PDF docs, I cannot extract 
> text, but this is because I'm doing something wrong. The PDF spec is 
> large and complex and there are various ways in which to store and 
> encode text. I wanted to post here and ask if anyone is interested in 
> helping make the script better which means it should accurately 
> extract text from most any pdf file... not just some.
> 
> I know the topic of reading/extracting the text from a PDF document 
> natively in Python comes up every now and then on comp.lang.python...
> I've posted about it in the past myself. After searching for other 
> solutions, I've resorted to attempting this on my own in my spare time.
> Using apps external to Python (pdftotext, etc.) is not really an 
> option for me. If someone knows of a free native Python app that does 
> this now, let me know and I'll use that instead!
> 
> So, if other more experienced programmer are interested in helping 
> make the script better, please let me know. I can host a website and 
> the latest revision and do all of the grunt work.
> 
> Thanks,
> 
> Brad

As mentioned before, extracting plain text from a PDF document can be hit or miss. I have tried all the following applications (free/open source) on Arch Linux. Note, I would execute the commands with subprocess and capture stdout or read plain text file created by the application.

* textract (uses pdftotext)
- https://github.com/deanmalmgren/textract

* pdftotext
- http://poppler.freedesktop.org/
- cmd: pdftotext -layout "/path/to/document.pdf" -
- cmd: pdftotext "/path/to/document.pdf" -

* Calibre
- http://calibre-ebook.com/
- cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chapters-in-toc

* AbiWord
- http://www.abiword.org/
- cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf"

* Apache Tika
- https://tika.apache.org/
- cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main "/path/to/document.pdf"

For my application, I saw the best results using Apache Tika. However, I do still encounter strange encoding or extraction issues, e.g. S P A C E D  O U T  H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of repairing/cleaning methods.

I welcome an improved solution that has some intelligence like comparing the extract plain text order to a snapshot of the pdf page using OCR.
--
https://mail.python.org/mailman/listinfo/python-list

Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread

Thread

Re: Script to extract text from PDF files Scott Werner <scott.werner.vt@gmail.com> - 2015-11-06 14:29 -0800
  RE: Script to extract text from PDF files Dan Strohl <D.Strohl@F5.com> - 2015-11-06 22:46 +0000

csiph-web