Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #98456
| Path | csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail |
|---|---|
| From | Dan Strohl <D.Strohl@F5.com> |
| Newsgroups | comp.lang.python |
| Subject | RE: Script to extract text from PDF files |
| Date | Fri, 6 Nov 2015 22:46:01 +0000 |
| Lines | 79 |
| Message-ID | <mailman.131.1446996921.16136.python-list@python.org> (permalink) |
| References | <fdbh95$smc$1@solaris.cc.vt.edu> <ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset="us-ascii" |
| Content-Transfer-Encoding | quoted-printable |
| X-Trace | news.uni-berlin.de GOvJuHuZtiyMAP7jvUC0vAwuNbzPhEOHXWGlq/SnuFdw== |
| Return-Path | <D.Strohl@f5.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.003 |
| X-Spam-Evidence | '*H*': 0.99; '*S*': 0.00; 'url:pypi': 0.03; 'subject:text': 0.04; 'revision': 0.05; 'friday,': 0.07; 'brad': 0.09; 'encode': 0.09; 'spec': 0.09; 'stdout': 0.09; 'subject:files': 0.09; 'thrown': 0.09; 'url:apache': 0.09; 'url:github': 0.09; 'python': 0.10; 'apache': 0.14; 'apps': 0.15; 'encoding': 0.15; 'message-----': 0.15; 'crude': 0.16; 'middle,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'skip:[ 50': 0.16; 'some)': 0.16; 'subprocess': 0.16; 'url:freedesktop': 0.16; 'url:submit': 0.16; 'wrote:': 0.16; 'app': 0.16; 'looked': 0.16; 'comparing': 0.18; 'programmer': 0.18; '2015': 0.20; 'skip:" 30': 0.20; 'to:name:python-list@python.org': 0.20; 'posted': 0.21; 'to:2**1': 0.21; 'text,': 0.22; 'thanks,': 0.24; 'tried': 0.24; 'plain': 0.24; 'header:In-Reply-To:1': 0.24; 'script': 0.25; "i've": 0.25; 'skip:" 20': 0.26; 'external': 0.27; 'host': 0.28; 'attempting': 0.29; 'methods.': 0.29; 'received:192.168.10': 0.29; "i'm": 0.30; 'url:mailman': 0.30; 'work.': 0.30; 'e.g.': 0.30; 'option': 0.31; 'post': 0.31; 'anyone': 0.32; 'knows': 0.32; 'etc.)': 0.32; 'topic': 0.32; 'url:python': 0.33; 'extract': 0.33; "i'll": 0.33; 'skip:- 10': 0.34; 'url:listinfo': 0.34; 'know.': 0.34; 'file': 0.34; 'so,': 0.35; 'sent:': 0.35; 'text': 0.35; 'text.': 0.35; 'url:%3a': 0.35; 'something': 0.35; 'subject:': 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'url:org': 0.36; 'created': 0.36; 'possible': 0.36; '(and': 0.36; 'email addr:python.org': 0.36; 'url:action': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'really': 0.37; 'pdf': 0.37; 'charset :us-ascii': 0.37; 'wanted': 0.37; 'doing': 0.38; '2007': 0.38; 'someone': 0.38; 'files': 0.38; 'means': 0.39; 'skip:- 20': 0.39; 'does': 0.39; 'subject:from': 0.39; 'from:': 0.39; 'received:192': 0.39; 'url:mail': 0.40; 'to:addr:python.org': 0.40; 'still': 0.40; 'some': 0.40; 'improved': 0.63; 'strange': 0.63; 'more': 0.63; 'latest': 0.64; 'url:pdf': 0.64; 'capture': 0.66; 'python-list': 0.66; 'here': 0.66; 'results': 0.66; 'email name:python-list': 0.67; 'helping': 0.67; 'saw': 0.77; 'url:search': 0.83; '2:30': 0.84; 'accurately': 0.84; 'miss.': 0.84; 'received:192.168.15': 0.84; 'snapshot': 0.84; 'tika': 0.84; 'url:167': 0.84; 'utc-4,': 0.84; 'url:tutorial': 0.91; 'scott': 0.93 |
| DKIM-Signature | v=1; a=rsa-sha256; c=simple/simple; d=f5.com; i=@f5.com; q=dns/txt; s=seattle; t=1446849979; x=1478385979; h=from:to:subject:date:message-id:references:in-reply-to: content-transfer-encoding:mime-version; bh=5HIuOH45nZVMntLNDKpssJcfC2Q2uCJCrQBBprJp2VE=; b=jERrL096lvuslq4dCCqD/2Yvw6utr1vFynTwjl803xcw2iOhv6MVCTNJ w0sQYGF9UlJMl/EqCtNcgmcxLdKQgKHJCmbmrKt+PZ/srm15P2L4WaBdO eTHX+kvXjfUi6+S6IlbbtQ43jnZtjnANzvUqG0Nv34Y7KCcygnd2WFYEd c=; |
| X-IronPort-AV | E=Sophos;i="5.20,254,1444694400"; d="scan'208";a="187473001" |
| X-IPAS-Result | A2G3BACCLT1W/+sKqMBeGQEBAQEPAQEBAQYBAQEBg1RvBr9QIRkHFwyFI0oCggcBAQEBAQEEgQeCNywQAQEBAQEBAQEBIwEBAQEBAQEBAQEBAQEcAg1eAQEBAQMBAQE3SwQCAQgRBAEBAR4JBycLFAkIAgQBEggBEoggwRABAQEBAQUBAQEBAQEBAQEahlSDeIEGhDsBAYR8BY0biS2FHYJwhRGBYkmDd4MlkwSEZ3IBg1I6gQcBAQE |
| Thread-Topic | Script to extract text from PDF files |
| Thread-Index | AQHRGOLfevuYC/4nPk+Je3OtezMyoJ6Pl0eg |
| In-Reply-To | <ebeea6ba-f26b-452d-8a75-1338f3a4a9f6@googlegroups.com> |
| Accept-Language | en-US |
| Content-Language | en-US |
| X-MS-Has-Attach | |
| X-MS-TNEF-Correlator | |
| x-ms-exchange-transport-fromentityheader | Hosted |
| x-originating-ip | [192.168.15.239] |
| X-Mailman-Approved-At | Sun, 08 Nov 2015 10:35:19 -0500 |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.20+ |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Xref | csiph.com comp.lang.python:98456 |
Show key headers only | View raw
Its possible (likely) that I came into this in the middle, so sorry if this was already thrown out... but have you looked at any of the following suggestions? https://pypi.python.org/pypi?%3Aaction=search&term=pdf+convert&submit=search http://stackoverflow.com/questions/6413441/python-pdf-library https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 -----Original Message----- From: Python-list [mailto:python-list-bounces+d.strohl=f5.com@python.org] On Behalf Of Scott Werner Sent: Friday, November 06, 2015 2:30 PM To: python-list@python.org Subject: Re: Script to extract text from PDF files On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote: > I have a very crude Python script that extracts text from some (and I > emphasize some) PDF documents. On many PDF docs, I cannot extract > text, but this is because I'm doing something wrong. The PDF spec is > large and complex and there are various ways in which to store and > encode text. I wanted to post here and ask if anyone is interested in > helping make the script better which means it should accurately > extract text from most any pdf file... not just some. > > I know the topic of reading/extracting the text from a PDF document > natively in Python comes up every now and then on comp.lang.python... > I've posted about it in the past myself. After searching for other > solutions, I've resorted to attempting this on my own in my spare time. > Using apps external to Python (pdftotext, etc.) is not really an > option for me. If someone knows of a free native Python app that does > this now, let me know and I'll use that instead! > > So, if other more experienced programmer are interested in helping > make the script better, please let me know. I can host a website and > the latest revision and do all of the grunt work. > > Thanks, > > Brad As mentioned before, extracting plain text from a PDF document can be hit or miss. I have tried all the following applications (free/open source) on Arch Linux. Note, I would execute the commands with subprocess and capture stdout or read plain text file created by the application. * textract (uses pdftotext) - https://github.com/deanmalmgren/textract * pdftotext - http://poppler.freedesktop.org/ - cmd: pdftotext -layout "/path/to/document.pdf" - - cmd: pdftotext "/path/to/document.pdf" - * Calibre - http://calibre-ebook.com/ - cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chapters-in-toc * AbiWord - http://www.abiword.org/ - cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf" * Apache Tika - https://tika.apache.org/ - cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main "/path/to/document.pdf" For my application, I saw the best results using Apache Tika. However, I do still encounter strange encoding or extraction issues, e.g. S P A C E D O U T H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of repairing/cleaning methods. I welcome an improved solution that has some intelligence like comparing the extract plain text order to a snapshot of the pdf page using OCR. -- https://mail.python.org/mailman/listinfo/python-list
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Re: Script to extract text from PDF files Scott Werner <scott.werner.vt@gmail.com> - 2015-11-06 14:29 -0800 RE: Script to extract text from PDF files Dan Strohl <D.Strohl@F5.com> - 2015-11-06 22:46 +0000
csiph-web