Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.ruby > #4693 > unrolled thread
| Started by | Felipe Espinoza <fespinozacast@gmail.com> |
|---|---|
| First post | 2011-05-17 16:04 -0500 |
| Last post | 2011-05-19 10:21 +0200 |
| Articles | 8 — 5 participants |
Back to article view | Back to comp.lang.ruby
Pdf Parsing Challenge Felipe Espinoza <fespinozacast@gmail.com> - 2011-05-17 16:04 -0500
Re: Pdf Parsing Challenge Phillip Gawlowski <cmdjackryan@googlemail.com> - 2011-05-17 16:31 -0500
Re: Pdf Parsing Challenge Felipe Espinoza <fespinozacast@gmail.com> - 2011-05-17 16:38 -0500
Re: Pdf Parsing Challenge Phillip Gawlowski <cmdjackryan@googlemail.com> - 2011-05-17 16:45 -0500
Re: Pdf Parsing Challenge Mark T <paradisaeidae@gmail.com> - 2011-05-17 19:42 -0500
Re: Pdf Parsing Challenge Mark T <paradisaeidae@gmail.com> - 2011-05-17 19:37 -0500
Re: Pdf Parsing Challenge Kouhei Sutou <kou@cozmixng.org> - 2011-05-18 08:23 -0500
Re: Pdf Parsing Challenge Johannes Held <johannes.held@informatik.uni-erlangen.de> - 2011-05-19 10:21 +0200
| From | Felipe Espinoza <fespinozacast@gmail.com> |
|---|---|
| Date | 2011-05-17 16:04 -0500 |
| Subject | Pdf Parsing Challenge |
| Message-ID | <b3e54e146d346d393b16b935800076bb@ruby-forum.com> |
Hi Everyone, I'm just trying to use the pdf-reader gem, but I have some trouble understading how the gem wokds If someone can help me with this, i'll be really grateful The Problem: I have to extract some data from a paper in a pdf format. I just need some data from the page 1, like the title of the paper, the authors list, the universities of these autors, their mails, the abstract and keywords how I can extract this data from this paper? http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf with a simple string that contains the information of a complete field (keywords, abstract, etc) would help me It's not necessary to use this gem, but I need a string for each field with this info, how can I do that? -- Posted via http://www.ruby-forum.com/.
[toc] | [next] | [standalone]
| From | Phillip Gawlowski <cmdjackryan@googlemail.com> |
|---|---|
| Date | 2011-05-17 16:31 -0500 |
| Message-ID | <BANLkTi=RDPn6fxrtwsMTc5TJo=Ofb5Sh-Q@mail.gmail.com> |
| In reply to | #4693 |
On Tue, May 17, 2011 at 11:04 PM, Felipe Espinoza <fespinozacast@gmail.com> wrote: > > I have to extract some data from a paper in a pdf format. I just need > some data from the page 1, like the title of the paper, the authors > list, the universities of these autors, their mails, the abstract and > keywords > > how I can extract this data from this paper? > http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf Mark the text, copy it. > It's not necessary to use this gem, but I need a string for each field > with this info, how can I do that? Open a text editor, paste it, and construct the data you need. Doing the research for how to do what you want, and then writing and debugging a script that does it, takes longer than just doing it by hand. ;) -- Phillip Gawlowski Though the folk I have met, (Ah, how soon!) they forget When I've moved on to some other place, There may be one or two, When I've played and passed through, Who'll remember my song or my face.
[toc] | [prev] | [next] | [standalone]
| From | Felipe Espinoza <fespinozacast@gmail.com> |
|---|---|
| Date | 2011-05-17 16:38 -0500 |
| Message-ID | <5dd484266f0d9bfdab234da0624dc155@ruby-forum.com> |
| In reply to | #4695 |
I need to do this automatically, I'll be doing it for a lot of papers and then take that data to a database -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | Phillip Gawlowski <cmdjackryan@googlemail.com> |
|---|---|
| Date | 2011-05-17 16:45 -0500 |
| Message-ID | <BANLkTimtdB5fvPipSM_PV4aEJ+yK_-zWDA@mail.gmail.com> |
| In reply to | #4696 |
On Tue, May 17, 2011 at 11:38 PM, Felipe Espinoza <fespinozacast@gmail.com> wrote: > I need to do this automatically, I'll be doing it for a lot of papers > and then take that data to a database Unless the papers are all (near) identical in layout, this will be difficult, since PDFs lack semantic information. Can you instead query a DB for the DOI of the paper (getting the DOI via the filename, or via the title of the paper, assuming the title is easy to grab), and use said DOI DB to get the information in a way that's much easier to process? -- Phillip Gawlowski Though the folk I have met, (Ah, how soon!) they forget When I've moved on to some other place, There may be one or two, When I've played and passed through, Who'll remember my song or my face.
[toc] | [prev] | [next] | [standalone]
| From | Mark T <paradisaeidae@gmail.com> |
|---|---|
| Date | 2011-05-17 19:42 -0500 |
| Message-ID | <BANLkTi=jiaFEJyZvhnVeXFyV3gM0GWgj-A@mail.gmail.com> |
| In reply to | #4693 |
Inkscape has a command line conversion option. I've only used it with a Linux instance. It converts one page at a time though. More than thee output format options from memory. Not exactly pure Ruby approach, though scripting such a task is certainly a Ruby domain. Your example is still loading here. So this reply may be completely out of context. MarkT > I have to extract some data from a paper in a pdf format. I just need > some data from the page 1, like the title of the paper, the authors > list, the universities of these autors, their mails, the abstract and > keywords I _top_ _post_ _so_ _there_
[toc] | [prev] | [next] | [standalone]
| From | Mark T <paradisaeidae@gmail.com> |
|---|---|
| Date | 2011-05-17 19:37 -0500 |
| Message-ID | <BANLkTikXveF-LiP-=meUd4HHGL4e57emrg@mail.gmail.com> |
| In reply to | #4693 |
Inkscape has a command line conversion option. I've only used it with a Linux instance. It converts one page at a time though. More than thee output format options from memory. Not exactly pure Ruby approach, though scripting such a task is certainly a Ruby domain. Your example is still loading here. So this reply may be completely out of context. MarkT > I have to extract some data from a paper in a pdf format. I just need > some data from the page 1, like the title of the paper, the authors > list, the universities of these autors, their mails, the abstract and > keywords I _top_ _post_ _so_ _there_
[toc] | [prev] | [next] | [standalone]
| From | Kouhei Sutou <kou@cozmixng.org> |
|---|---|
| Date | 2011-05-18 08:23 -0500 |
| Message-ID | <20110518.222351.494672229751371503.kou@cozmixng.org> |
| In reply to | #4693 |
Hi,
In <b3e54e146d346d393b16b935800076bb@ruby-forum.com>
"Pdf Parsing Challenge" on Wed, 18 May 2011 06:04:19 +0900,
Felipe Espinoza <fespinozacast@gmail.com> wrote:
> The Problem:
>
> I have to extract some data from a paper in a pdf format. I just need
> some data from the page 1, like the title of the paper, the authors
> list, the universities of these autors, their mails, the abstract and
> keywords
>
> how I can extract this data from this paper?
> http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
>
> with a simple string that contains the information of a complete field
> (keywords, abstract, etc) would help me
% gem install poppler
% cat extract-data-from-paper.rb
require 'tempfile'
require 'open-uri'
require 'poppler'
ARGV.each do |url|
pdf = Tempfile.new(["extract-data-from-paper", ".pdf"])
pdf.binmode
open(url) do |input|
pdf.write(input.read)
end
pdf.close
document = Poppler::Document.new(pdf.path)
title_page = document.pages.first
text = title_page.get_text
lines = text.lines.to_a
title = lines[0, 2].collect(&:strip).join(" ")
puts title
authors = lines[2, 2].collect(&:strip).join(" ")
puts authors
# ...
end
% ruby1.9 extract-data-from-paper.rb http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf
Query Routing Process for Adapted Information Retrieval using Agents
Angela Carrillo-Ramos2, Jérôme Gensel1, Marlène Villanova-Oliver1, Hervé Martin1, and Miguel Torres-Moreno2
Thanks,
--
kou
[toc] | [prev] | [next] | [standalone]
| From | Johannes Held <johannes.held@informatik.uni-erlangen.de> |
|---|---|
| Date | 2011-05-19 10:21 +0200 |
| Message-ID | <93k25cFal0U1@mid.dfncis.de> |
| In reply to | #4693 |
Do you need that for an own application or do you want to build up a literature database on your own? For the latter, you could try [Mendeley][1]. That's a tool (web-based & desktop-based) to manage your research literature. It can parse PDF, and much more. Once parsed, you can parse the generated bibtex-file … [1]: http://www.mendeley.com -- Gruß, Johannes
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.ruby
csiph-web