Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.os.linux.misc > #14850

Re: pdf & O.C.R ?

From John-Paul Stewart <jpstewart@sympatico.ca>
Newsgroups comp.os.linux.misc
Subject Re: pdf & O.C.R ?
Date 2015-05-23 20:46 -0400
Message-ID <bjj73c-fim.ln1@mail.binaryfoundry.ca> (permalink)
References <pan.2015.05.23.07.50.46@gmail.com>

Show all headers | View raw


On 23/05/15 03:49 AM, Unknown wrote:
> I'm confused and disturbed that xpdf of:
> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>    is perfect to the pixel, with maximum magnification [400%],
>    which is expected, since it's computer-font generated, whereas:
> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
> 20LAW%20ACT.pdf
>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>
> And scanned copies of papers are apparently normal.
>
> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
> from COMPANY%20LAW%20ACT.pdf ?
> That would mean that the mouse-driver is doing O.C.R.   ?!

Why would you think the mouse driver is doing OCR?

A PDF file can contain both text and images.  It is common when scanning 
paper documents to turn them into a so-called "searchable PDF" that 
contains the scanned image of the page overlaid on top of the (OCRed) 
text.  So what you see visually is the (possibly blurry) picture, while 
what the mouse is copying (and pdftotext is extracting) is the text 
that's hidden underneath.

Adobe's own Acrobat software can create such "searchable PDF" files. 
I'm sure there are other tools, too.

Back to comp.os.linux.misc | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
  Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
    Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
  Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
    Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
      Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
        Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
    Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
      Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400

csiph-web