Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: John-Paul Stewart <jpstewart@sympatico.ca>
Newsgroups: comp.os.linux.misc
Subject: Re: pdf & O.C.R ?
Date: Sat, 23 May 2015 20:46:03 -0400
Lines: 27
Message-ID: <bjj73c-fim.ln1@mail.binaryfoundry.ca>
References: <pan.2015.05.23.07.50.46@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net SWlP+RX26H5bo/uuQGfanwi2JdSu6JIhKzbbdDbqX1hNIS0dHb
X-Orig-Path: mail.binaryfoundry.ca!not-for-mail
Cancel-Lock: sha1:S1pYfAZMaxUECTQG0jrV47d4mY8=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0
In-Reply-To: <pan.2015.05.23.07.50.46@gmail.com>
Xref: csiph.com comp.os.linux.misc:14850

On 23/05/15 03:49 AM, Unknown wrote:
> I'm confused and disturbed that xpdf of:
> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>    is perfect to the pixel, with maximum magnification [400%],
>    which is expected, since it's computer-font generated, whereas:
> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
> 20LAW%20ACT.pdf
>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>
> And scanned copies of papers are apparently normal.
>
> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
> from COMPANY%20LAW%20ACT.pdf ?
> That would mean that the mouse-driver is doing O.C.R.   ?!

Why would you think the mouse driver is doing OCR?

A PDF file can contain both text and images.  It is common when scanning 
paper documents to turn them into a so-called "searchable PDF" that 
contains the scanned image of the page overlaid on top of the (OCRed) 
text.  So what you see visually is the (possibly blurry) picture, while 
what the mouse is copying (and pdftotext is extracting) is the text 
that's hidden underneath.

Adobe's own Acrobat software can create such "searchable PDF" files. 
I'm sure there are other tools, too.