Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed0.kamp.net!newsfeed.kamp.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: John-Paul Stewart <jpstewart@sympatico.ca>
Newsgroups: comp.os.linux.misc
Subject: Re: pdf & O.C.R ?
Date: Fri, 29 May 2015 20:31:23 -0400
Lines: 47
Message-ID: <rvcn3c-jg9.ln1@mail.binaryfoundry.ca>
References: <pan.2015.05.23.07.50.46@gmail.com>	<bjj73c-fim.ln1@mail.binaryfoundry.ca> <pan.2015.05.27.17.12.03@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net 0qA3qU/xuVj+1BUnroCUDgrIEtt/uSGEUjT3Ub/AG2lSppp0fC
X-Orig-Path: mail.binaryfoundry.ca!not-for-mail
Cancel-Lock: sha1:BKB2ZPisnrJGV8DSlGpvZ+csn04=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0
In-Reply-To: <pan.2015.05.27.17.12.03@gmail.com>
Xref: csiph.com comp.os.linux.misc:14883

On 27/05/15 01:10 PM, Unknown wrote:
> On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
>
>> On 23/05/15 03:49 AM, Unknown wrote:
>>> I'm confused and disturbed that xpdf of:
>>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>>     is perfect to the pixel, with maximum magnification [400%], which is
>>>     expected, since it's computer-font generated, whereas:
>>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
>>> 20LAW%20ACT.pdf
>>>     shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>>
>>> And scanned copies of papers are apparently normal.
>>>
>>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
>>> from COMPANY%20LAW%20ACT.pdf ?
>>> That would mean that the mouse-driver is doing O.C.R.   ?!
>>
>> Why would you think the mouse driver is doing OCR?
>>
> OBVIOUSLY from my description there's pdftotext happening via mouse.

No, not at all.  The mouse is merely selecting/copying some text that is 
already there.

>> A PDF file can contain both text and images.  It is common when scanning
>> paper documents to turn them into a so-called "searchable PDF" that
>> contains the scanned image of the page overlaid on top of the (OCRed)
>> text.  So what you see visually is the (possibly blurry) picture, while
>> what the mouse is copying (and pdftotext is extracting) is the text
>> that's hidden underneath.
>>
>> Adobe's own Acrobat software can create such "searchable PDF" files. I'm
>> sure there are other tools, too.
>
> What extreme deception. There's a layer of pixel-perfect pdftotext-able,
> covered by the blurry photo-image ?!

Again, no.  There's no deception and nothing "pixel-perfect" about the 
hidden text.  It will have lost nearly all formatting during the OCR 
process.  The text itself might be plain wrong (about 90% accurate, 
IME), due to the complexities of OCR on poor quality images.  The image 
of a scanned document will be more reliable for a human reader than the 
OCRed text.  The text is provided in (some, so-called "searchable") PDFs 
as a convenience for searching within the PDF file and isn't guaranteed 
to be accurate.  (Talking only about PDFs created from scanned documents.)