Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!newsfeed.straub-nv.de!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: Unknown <dog@gmail.com>
Newsgroups: comp.os.linux.misc
Subject: Re: pdf & O.C.R ?
Date: Sat, 13 Jun 2015 13:29:22 +0000 (UTC)
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <pan.2015.06.13.13.30.56@gmail.com>
References: <pan.2015.05.23.07.50.46@gmail.com> <bjj73c-fim.ln1@mail.binaryfoundry.ca> <mk1sam$r2l$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 13 Jun 2015 13:29:22 +0000 (UTC)
Injection-Info: mx02.eternal-september.org; posting-host="14721bec97c85d45e51d42a6b98c7030"; logging-data="26765"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/25ySso8S429ONqeWggH8V0VSsFx1osQA="
User-Agent: Pan/0.133 (House of Butterflies)
Cancel-Lock: sha1:4DPQbEJikdCz/ycM79yhRzo+uqc=
Xref: csiph.com comp.os.linux.misc:14921

On Tue, 26 May 2015 13:26:46 +0000, Joe Beanfish wrote:

> On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
> 
>> On 23/05/15 03:49 AM, Unknown wrote:
>>> I'm confused and disturbed that xpdf of:
>>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>>    is perfect to the pixel, with maximum magnification [400%], which
>>>    is expected, since it's computer-font generated, whereas:
>>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY
%
>>> 20LAW%20ACT.pdf
>>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>>
>>> And scanned copies of papers are apparently normal.
>>>
>>> BUT!! How is it that xpdf allows me to extract the text, via
>>> mouse-copy from COMPANY%20LAW%20ACT.pdf ?
>>> That would mean that the mouse-driver is doing O.C.R.   ?!
>> 
>> Why would you think the mouse driver is doing OCR?
>> 
>> A PDF file can contain both text and images.  It is common when
>> scanning paper documents to turn them into a so-called "searchable PDF"
>> that contains the scanned image of the page overlaid on top of the
>> (OCRed) text.  So what you see visually is the (possibly blurry)
>> picture, while what the mouse is copying (and pdftotext is extracting)
>> is the text that's hidden underneath.
>> 
>> Adobe's own Acrobat software can create such "searchable PDF" files.
>> I'm sure there are other tools, too.
> 
This is TOO-MUCH!!
You mean they send the original-keyed-in-pdftotextable, AND the graphical
image of the crumpled-paper-version <overlaid>. What's the aim of such
expensive deception?

> Yeah, It's kinda interesting when your workstation's bogged down and the
> pdf is big you might see the OCR text render first, then the image will
> render, covering it up. Or maybe that only happens in the browser when
> it's downloading and hasn't gotten to the image yet? Haven't seen it
> happen in a while.