Groups > comp.lang.java.programmer > #10527 > unrolled thread

pdf index builder

Started by	Giovanni Azua <bravegag@hotmail.com>
First post	2011-12-05 15:03 +0000
Last post	2011-12-09 10:58 -0800
Articles	3 — 3 participants

Back to article view | Back to comp.lang.java.programmer

  pdf index builder Giovanni Azua <bravegag@hotmail.com> - 2011-12-05 15:03 +0000
    Re: pdf index builder Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-12-06 11:46 -0400
    Re: pdf index builder Roedy Green <see_website@mindprod.com.invalid> - 2011-12-09 10:58 -0800

#10527 — pdf index builder

From	Giovanni Azua <bravegag@hotmail.com>
Date	2011-12-05 15:03 +0000
Subject	pdf index builder
Message-ID	<563461764344789594.415272bravegag-hotmail.com@news.individual.net>

Hello!

I have the strong need to do the following. Given a set of PDF files
scattered across multiple directories, build a global index that includes
for every index term the file names and corresponding pages where such
index occurs. A really nice to have would be to "parse" formulas but I
guess these are stored as images ...

Before I go ahead and build a solution using Apache's PDFBox and/or iText
can anyone advice if such solution exists? even if commercial? I googled
for this already ...

My use-case for this is a very critical open book exam but there are no
books instead a bunch of dense PDF papers and lectures (a lot) if I get
such index I might get an edge here :)

TIA,
Best regards,
Giovanni

-- Giovanni

[toc] | [next] | [standalone]

#10558

From	Arved Sandstrom <asandstrom3minus1@eastlink.ca>
Date	2011-12-06 11:46 -0400
Message-ID	<yNqDq.11598$c27.562@newsfe22.iad>
In reply to	#10527

On 11-12-05 11:03 AM, Giovanni Azua wrote:
> Hello!
> 
> I have the strong need to do the following. Given a set of PDF files
> scattered across multiple directories, build a global index that includes
> for every index term the file names and corresponding pages where such
> index occurs. A really nice to have would be to "parse" formulas but I
> guess these are stored as images ...
> 
> Before I go ahead and build a solution using Apache's PDFBox and/or iText
> can anyone advice if such solution exists? even if commercial? I googled
> for this already ...
> 
> My use-case for this is a very critical open book exam but there are no
> books instead a bunch of dense PDF papers and lectures (a lot) if I get
> such index I might get an edge here :)
> 
> TIA,
> Best regards,
> Giovanni
> 
> -- Giovanni

Presumably you don't want to get as high-powered (and costly and
complicated) as something like CBR (content based retrieval) in IBM
FileNet P8. :-)

AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and
indexing. If you're in control of the entire Alfresco system you'd have
access to the indexing data in its raw form. But I don't see the point,
I'd myself simply run PDFBox and Lucene standalone, if all you want is a
global index. Granted, Alfresco is not a complicated install.

One note: PDFBox is noted by a number of commentators to be slow in the
Alfresco environment. For all I know it's slow, period. You might want
to consider pdftotext. There are some decent articles on using it vice
PDFBox with Alfresco.

AHS

[toc] | [prev] | [next] | [standalone]

#10631

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-09 10:58 -0800
Message-ID	<ggm4e7926ptdfl5jdfi5jfnqi5jsapkd91@4ax.com>
In reply to	#10527

On 5 Dec 2011 15:03:46 GMT, Giovanni Azua <bravegag@hotmail.com>
wrote, quoted or indirectly quoted someone who said :

>Before I go ahead and build a solution using Apache's PDFBox and/or iText
>can anyone advice if such solution exists? even if commercial? I googled
>for this already ...

there is a ton of PDF utilities.  Have a browse at
http://mindprod.com/jgloss/pdf.html

I would be quite surprised if what you want does not exist.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [prev] | [standalone]

csiph-web

pdf index builder

Contents

#10527 — pdf index builder

#10558

#10631