Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Dan Strohl Newsgroups: comp.lang.python Subject: RE: Script to extract text from PDF files Date: Fri, 6 Nov 2015 22:46:01 +0000 Lines: 79 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de GOvJuHuZtiyMAP7jvUC0vAwuNbzPhEOHXWGlq/SnuFdw== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'url:pypi': 0.03; 'subject:text': 0.04; 'revision': 0.05; 'friday,': 0.07; 'brad': 0.09; 'encode': 0.09; 'spec': 0.09; 'stdout': 0.09; 'subject:files': 0.09; 'thrown': 0.09; 'url:apache': 0.09; 'url:github': 0.09; 'python': 0.10; 'apache': 0.14; 'apps': 0.15; 'encoding': 0.15; 'message-----': 0.15; 'crude': 0.16; 'middle,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'skip:[ 50': 0.16; 'some)': 0.16; 'subprocess': 0.16; 'url:freedesktop': 0.16; 'url:submit': 0.16; 'wrote:': 0.16; 'app': 0.16; 'looked': 0.16; 'comparing': 0.18; 'programmer': 0.18; '2015': 0.20; 'skip:" 30': 0.20; 'to:name:python-list@python.org': 0.20; 'posted': 0.21; 'to:2**1': 0.21; 'text,': 0.22; 'thanks,': 0.24; 'tried': 0.24; 'plain': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'skip:" 20': 0.26; 'external': 0.27; 'host': 0.28; 'attempting': 0.29; 'methods.': 0.29; 'received:192.168.10': 0.29; "i'm": 0.30; 'url:mailman': 0.30; 'work.': 0.30; 'e.g.': 0.30; 'option': 0.31; 'post': 0.31; 'anyone': 0.32; 'knows': 0.32; 'etc.)': 0.32; 'topic': 0.32; 'url:python': 0.33; 'extract': 0.33; "i'll": 0.33; 'skip:- 10': 0.34; 'url:listinfo': 0.34; 'know.': 0.34; 'file': 0.34; 'so,': 0.35; 'sent:': 0.35; 'text': 0.35; 'text.': 0.35; 'url:%3a': 0.35; 'something': 0.35; 'subject:': 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'url:org': 0.36; 'created': 0.36; 'possible': 0.36; '(and': 0.36; 'email addr:python.org': 0.36; 'url:action': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'pdf': 0.37; 'charset :us-ascii': 0.37; 'wanted': 0.37; 'doing': 0.38; '2007': 0.38; 'someone': 0.38; 'files': 0.38; 'means': 0.39; 'skip:- 20': 0.39; 'does': 0.39; 'subject:from': 0.39; 'from:': 0.39; 'received:192': 0.39; 'url:mail': 0.40; 'to:addr:python.org': 0.40; 'still': 0.40; 'some': 0.40; 'improved': 0.63; 'strange': 0.63; 'more': 0.63; 'latest': 0.64; 'url:pdf': 0.64; 'capture': 0.66; 'python-list': 0.66; 'here': 0.66; 'results': 0.66; 'email name:python-list': 0.67; 'helping': 0.67; 'saw': 0.77; 'url:search': 0.83; '2:30': 0.84; 'accurately': 0.84; 'miss.': 0.84; 'received:192.168.15': 0.84; 'snapshot': 0.84; 'tika': 0.84; 'url:167': 0.84; 'utc-4,': 0.84; 'url:tutorial': 0.91; 'scott': 0.93 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=f5.com; i=@f5.com; q=dns/txt; s=seattle; t=1446849979; x=1478385979; h=from:to:subject:date:message-id:references:in-reply-to: content-transfer-encoding:mime-version; bh=5HIuOH45nZVMntLNDKpssJcfC2Q2uCJCrQBBprJp2VE=; b=jERrL096lvuslq4dCCqD/2Yvw6utr1vFynTwjl803xcw2iOhv6MVCTNJ w0sQYGF9UlJMl/EqCtNcgmcxLdKQgKHJCmbmrKt+PZ/srm15P2L4WaBdO eTHX+kvXjfUi6+S6IlbbtQ43jnZtjnANzvUqG0Nv34Y7KCcygnd2WFYEd c=; X-IronPort-AV: E=Sophos;i="5.20,254,1444694400"; d="scan'208";a="187473001" X-IPAS-Result: A2G3BACCLT1W/+sKqMBeGQEBAQEPAQEBAQYBAQEBg1RvBr9QIRkHFwyFI0oCggcBAQEBAQEEgQeCNywQAQEBAQEBAQEBIwEBAQEBAQEBAQEBAQEcAg1eAQEBAQMBAQE3SwQCAQgRBAEBAR4JBycLFAkIAgQBEggBEoggwRABAQEBAQUBAQEBAQEBAQEahlSDeIEGhDsBAYR8BY0biS2FHYJwhRGBYkmDd4MlkwSEZ3IBg1I6gQcBAQE Thread-Topic: Script to extract text from PDF files Thread-Index: AQHRGOLfevuYC/4nPk+Je3OtezMyoJ6Pl0eg In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [192.168.15.239] X-Mailman-Approved-At: Sun, 08 Nov 2015 10:35:19 -0500 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.20+ Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:98456 Its possible (likely) that I came into this in the middle, so sorry if this= was already thrown out... but have you looked at any of the following sugg= estions? https://pypi.python.org/pypi?%3Aaction=3Dsearch&term=3Dpdf+convert&submit= =3Dsearch http://stackoverflow.com/questions/6413441/python-pdf-library https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 -----Original Message----- From: Python-list [mailto:python-list-bounces+d.strohl=3Df5.com@python.org]= On Behalf Of Scott Werner Sent: Friday, November 06, 2015 2:30 PM To: python-list@python.org Subject: Re: Script to extract text from PDF files On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote: > I have a very crude Python script that extracts text from some (and I=20 > emphasize some) PDF documents. On many PDF docs, I cannot extract=20 > text, but this is because I'm doing something wrong. The PDF spec is=20 > large and complex and there are various ways in which to store and=20 > encode text. I wanted to post here and ask if anyone is interested in=20 > helping make the script better which means it should accurately=20 > extract text from most any pdf file... not just some. >=20 > I know the topic of reading/extracting the text from a PDF document=20 > natively in Python comes up every now and then on comp.lang.python... > I've posted about it in the past myself. After searching for other=20 > solutions, I've resorted to attempting this on my own in my spare time. > Using apps external to Python (pdftotext, etc.) is not really an=20 > option for me. If someone knows of a free native Python app that does=20 > this now, let me know and I'll use that instead! >=20 > So, if other more experienced programmer are interested in helping=20 > make the script better, please let me know. I can host a website and=20 > the latest revision and do all of the grunt work. >=20 > Thanks, >=20 > Brad As mentioned before, extracting plain text from a PDF document can be hit o= r miss. I have tried all the following applications (free/open source) on A= rch Linux. Note, I would execute the commands with subprocess and capture s= tdout or read plain text file created by the application. * textract (uses pdftotext) - https://github.com/deanmalmgren/textract * pdftotext - http://poppler.freedesktop.org/ - cmd: pdftotext -layout "/path/to/document.pdf" - - cmd: pdftotext "/path/to/document.pdf" - * Calibre - http://calibre-ebook.com/ - cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chap= ters-in-toc * AbiWord - http://www.abiword.org/ - cmd: abiword --to-name=3Dfd://1 --to-TXT "/path/to/document.pdf" * Apache Tika - https://tika.apache.org/ - cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-= main "/path/to/document.pdf" For my application, I saw the best results using Apache Tika. However, I do= still encounter strange encoding or extraction issues, e.g. S P A C E D O= U T H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of= repairing/cleaning methods. I welcome an improved solution that has some intelligence like comparing th= e extract plain text order to a snapshot of the pdf page using OCR. -- https://mail.python.org/mailman/listinfo/python-list