Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10558
| From | Arved Sandstrom <asandstrom3minus1@eastlink.ca> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: pdf index builder |
| References | <563461764344789594.415272bravegag-hotmail.com@news.individual.net> |
| Message-ID | <yNqDq.11598$c27.562@newsfe22.iad> (permalink) |
| Organization | Public Usenet Newsgroup Access |
| Date | 2011-12-06 11:46 -0400 |
On 11-12-05 11:03 AM, Giovanni Azua wrote: > Hello! > > I have the strong need to do the following. Given a set of PDF files > scattered across multiple directories, build a global index that includes > for every index term the file names and corresponding pages where such > index occurs. A really nice to have would be to "parse" formulas but I > guess these are stored as images ... > > Before I go ahead and build a solution using Apache's PDFBox and/or iText > can anyone advice if such solution exists? even if commercial? I googled > for this already ... > > My use-case for this is a very critical open book exam but there are no > books instead a bunch of dense PDF papers and lectures (a lot) if I get > such index I might get an edge here :) > > TIA, > Best regards, > Giovanni > > -- Giovanni Presumably you don't want to get as high-powered (and costly and complicated) as something like CBR (content based retrieval) in IBM FileNet P8. :-) AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and indexing. If you're in control of the entire Alfresco system you'd have access to the indexing data in its raw form. But I don't see the point, I'd myself simply run PDFBox and Lucene standalone, if all you want is a global index. Granted, Alfresco is not a complicated install. One note: PDFBox is noted by a number of commentators to be slow in the Alfresco environment. For all I know it's slow, period. You might want to consider pdftotext. There are some decent articles on using it vice PDFBox with Alfresco. AHS
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
pdf index builder Giovanni Azua <bravegag@hotmail.com> - 2011-12-05 15:03 +0000 Re: pdf index builder Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-12-06 11:46 -0400 Re: pdf index builder Roedy Green <see_website@mindprod.com.invalid> - 2011-12-09 10:58 -0800
csiph-web