Magnifying pdf cleans irregularities?

From	no.top.post@gmail.com
Newsgroups	comp.lang.postscript
Subject	Magnifying pdf cleans irregularities?
Date	2011-10-22 21:41 +0000
Organization	A noiseless patient Spider
Message-ID	<j7vda5$2kk$1@dont-email.me> (permalink)

Show all headers | View raw

By using gocr on:
http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
I've been trying to extract the ASCII.

So far, using:
pdftoppm -f 13 -l 13 -r 300 sun.clarion2005.pdf | gocr -o ppm13.300
gives the best Optical Character recognition results.
But it sees "k" as "h".

What confuses me, is that when I view with xpdf, the text
looks as if it was printed by a bad-condition 1950 typewriter.

I especially remember "2004" where the 'bottoms' were 
badly un-aligned. But if I set xpdf to 'magnify' a section of
the text, it looks clean, and of course gocr decodes perfectly.

I don't know exactly how the rendering works, but imagine
that if the 'normal size' uses a bad quality font, and the 
magnified version uses a good quality font, that could
explain what I'm seeing.

Since the information that 'the char IS a "k" and not 
a "h" is in the *.pdf file, and quiet independant of ANY
rendering, and gocr can correctly decode BIG font,
should I not expect to be able to get gocr to decode
correctly, by <filtering it through a suiatble font>?

Thanks,

== Chris Glur.

Back to comp.lang.postscript | Previous | Next — Next in thread | Find similar

Thread

Magnifying pdf cleans irregularities? no.top.post@gmail.com - 2011-10-22 21:41 +0000
  Re: Magnifying pdf cleans irregularities? luser- -droog <mijoryx@yahoo.com> - 2011-10-22 23:19 -0700
  Re: Magnifying pdf cleans irregularities? Helge Blischke <h.blischke@acm.org> - 2011-10-23 10:29 +0200

csiph-web