Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.musoftware.de!wum.musoftware.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Helge Blischke Newsgroups: comp.lang.postscript Subject: Re: Magnifying pdf cleans irregularities? Followup-To: comp.lang.postscript Date: Sun, 23 Oct 2011 10:29:15 +0200 Lines: 40 Message-ID: <9gi1itF8diU1@mid.individual.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7Bit X-Trace: individual.net 08OMqFPcB/Zfwh7jZ1X6KQHfkpXuzFbb3JfxvV0OuNgrZAqaRS Cancel-Lock: sha1:hyMhBrVST57hcKrZjKYyku5N7J4= User-Agent: KNode/0.99.01 Xref: x330-a1.tempe.blueboxinc.net comp.lang.postscript:406 no.top.post@gmail.com wrote: > By using gocr on: > http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf > I've been trying to extract the ASCII. > > So far, using: > pdftoppm -f 13 -l 13 -r 300 sun.clarion2005.pdf | gocr -o ppm13.300 > gives the best Optical Character recognition results. > But it sees "k" as "h". > > What confuses me, is that when I view with xpdf, the text > looks as if it was printed by a bad-condition 1950 typewriter. > > I especially remember "2004" where the 'bottoms' were > badly un-aligned. But if I set xpdf to 'magnify' a section of > the text, it looks clean, and of course gocr decodes perfectly. > > I don't know exactly how the rendering works, but imagine > that if the 'normal size' uses a bad quality font, and the > magnified version uses a good quality font, that could > explain what I'm seeing. > > Since the information that 'the char IS a "k" and not > a "h" is in the *.pdf file, and quiet independant of ANY > rendering, and gocr can correctly decode BIG font, > should I not expect to be able to get gocr to decode > correctly, by ? > > Thanks, > > == Chris Glur. If you look at the PDF properties, you'll recognize that the fonts used are bitmapped type3 fonts (in a fairly high resolution, though). That leads to degraded rendering whenever recalculation of the bitmaps is required due to the different resolution of the canvas. Helge