Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10527 > unrolled thread
| Started by | Giovanni Azua <bravegag@hotmail.com> |
|---|---|
| First post | 2011-12-05 15:03 +0000 |
| Last post | 2011-12-09 10:58 -0800 |
| Articles | 3 — 3 participants |
Back to article view | Back to comp.lang.java.programmer
pdf index builder Giovanni Azua <bravegag@hotmail.com> - 2011-12-05 15:03 +0000
Re: pdf index builder Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-12-06 11:46 -0400
Re: pdf index builder Roedy Green <see_website@mindprod.com.invalid> - 2011-12-09 10:58 -0800
| From | Giovanni Azua <bravegag@hotmail.com> |
|---|---|
| Date | 2011-12-05 15:03 +0000 |
| Subject | pdf index builder |
| Message-ID | <563461764344789594.415272bravegag-hotmail.com@news.individual.net> |
Hello! I have the strong need to do the following. Given a set of PDF files scattered across multiple directories, build a global index that includes for every index term the file names and corresponding pages where such index occurs. A really nice to have would be to "parse" formulas but I guess these are stored as images ... Before I go ahead and build a solution using Apache's PDFBox and/or iText can anyone advice if such solution exists? even if commercial? I googled for this already ... My use-case for this is a very critical open book exam but there are no books instead a bunch of dense PDF papers and lectures (a lot) if I get such index I might get an edge here :) TIA, Best regards, Giovanni -- Giovanni
[toc] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom3minus1@eastlink.ca> |
|---|---|
| Date | 2011-12-06 11:46 -0400 |
| Message-ID | <yNqDq.11598$c27.562@newsfe22.iad> |
| In reply to | #10527 |
On 11-12-05 11:03 AM, Giovanni Azua wrote: > Hello! > > I have the strong need to do the following. Given a set of PDF files > scattered across multiple directories, build a global index that includes > for every index term the file names and corresponding pages where such > index occurs. A really nice to have would be to "parse" formulas but I > guess these are stored as images ... > > Before I go ahead and build a solution using Apache's PDFBox and/or iText > can anyone advice if such solution exists? even if commercial? I googled > for this already ... > > My use-case for this is a very critical open book exam but there are no > books instead a bunch of dense PDF papers and lectures (a lot) if I get > such index I might get an edge here :) > > TIA, > Best regards, > Giovanni > > -- Giovanni Presumably you don't want to get as high-powered (and costly and complicated) as something like CBR (content based retrieval) in IBM FileNet P8. :-) AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and indexing. If you're in control of the entire Alfresco system you'd have access to the indexing data in its raw form. But I don't see the point, I'd myself simply run PDFBox and Lucene standalone, if all you want is a global index. Granted, Alfresco is not a complicated install. One note: PDFBox is noted by a number of commentators to be slow in the Alfresco environment. For all I know it's slow, period. You might want to consider pdftotext. There are some decent articles on using it vice PDFBox with Alfresco. AHS
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-09 10:58 -0800 |
| Message-ID | <ggm4e7926ptdfl5jdfi5jfnqi5jsapkd91@4ax.com> |
| In reply to | #10527 |
On 5 Dec 2011 15:03:46 GMT, Giovanni Azua <bravegag@hotmail.com> wrote, quoted or indirectly quoted someone who said : >Before I go ahead and build a solution using Apache's PDFBox and/or iText >can anyone advice if such solution exists? even if commercial? I googled >for this already ... there is a ton of PDF utilities. Have a browse at http://mindprod.com/jgloss/pdf.html I would be quite surprised if what you want does not exist. -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web