Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10558

Re: pdf index builder

From Arved Sandstrom <asandstrom3minus1@eastlink.ca>
Newsgroups comp.lang.java.programmer
Subject Re: pdf index builder
References <563461764344789594.415272bravegag-hotmail.com@news.individual.net>
Message-ID <yNqDq.11598$c27.562@newsfe22.iad> (permalink)
Organization Public Usenet Newsgroup Access
Date 2011-12-06 11:46 -0400

Show all headers | View raw


On 11-12-05 11:03 AM, Giovanni Azua wrote:
> Hello!
> 
> I have the strong need to do the following. Given a set of PDF files
> scattered across multiple directories, build a global index that includes
> for every index term the file names and corresponding pages where such
> index occurs. A really nice to have would be to "parse" formulas but I
> guess these are stored as images ...
> 
> Before I go ahead and build a solution using Apache's PDFBox and/or iText
> can anyone advice if such solution exists? even if commercial? I googled
> for this already ...
> 
> My use-case for this is a very critical open book exam but there are no
> books instead a bunch of dense PDF papers and lectures (a lot) if I get
> such index I might get an edge here :)
> 
> TIA,
> Best regards,
> Giovanni
> 
> -- Giovanni

Presumably you don't want to get as high-powered (and costly and
complicated) as something like CBR (content based retrieval) in IBM
FileNet P8. :-)

AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and
indexing. If you're in control of the entire Alfresco system you'd have
access to the indexing data in its raw form. But I don't see the point,
I'd myself simply run PDFBox and Lucene standalone, if all you want is a
global index. Granted, Alfresco is not a complicated install.

One note: PDFBox is noted by a number of commentators to be slow in the
Alfresco environment. For all I know it's slow, period. You might want
to consider pdftotext. There are some decent articles on using it vice
PDFBox with Alfresco.

AHS

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

pdf index builder Giovanni Azua <bravegag@hotmail.com> - 2011-12-05 15:03 +0000
  Re: pdf index builder Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-12-06 11:46 -0400
  Re: pdf index builder Roedy Green <see_website@mindprod.com.invalid> - 2011-12-09 10:58 -0800

csiph-web