Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.os.linux.misc > #14880

Re: pdf & O.C.R ?

From Unknown <dog@gmail.com>
Newsgroups comp.os.linux.misc
Subject Re: pdf & O.C.R ?
Date 2015-05-27 17:10 +0000
Organization A noiseless patient Spider
Message-ID <pan.2015.05.27.17.12.03@gmail.com> (permalink)
References <pan.2015.05.23.07.50.46@gmail.com> <bjj73c-fim.ln1@mail.binaryfoundry.ca>

Show all headers | View raw


On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:

> On 23/05/15 03:49 AM, Unknown wrote:
>> I'm confused and disturbed that xpdf of:
>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>    is perfect to the pixel, with maximum magnification [400%], which is
>>    expected, since it's computer-font generated, whereas:
>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
>> 20LAW%20ACT.pdf
>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>
>> And scanned copies of papers are apparently normal.
>>
>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
>> from COMPANY%20LAW%20ACT.pdf ?
>> That would mean that the mouse-driver is doing O.C.R.   ?!
> 
> Why would you think the mouse driver is doing OCR?
> 
OBVIOUSLY from my description there's pdftotext happening via mouse.

> A PDF file can contain both text and images.  It is common when scanning
> paper documents to turn them into a so-called "searchable PDF" that
> contains the scanned image of the page overlaid on top of the (OCRed)
> text.  So what you see visually is the (possibly blurry) picture, while
> what the mouse is copying (and pdftotext is extracting) is the text
> that's hidden underneath.
> 
> Adobe's own Acrobat software can create such "searchable PDF" files. I'm
> sure there are other tools, too.

What extreme deception. There's a layer of pixel-perfect pdftotext-able,
covered by the blurry photo-image ?!

Back to comp.os.linux.misc | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
  Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
    Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
  Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
    Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
      Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
        Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
    Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
      Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400

csiph-web