Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: John-Paul Stewart Newsgroups: comp.os.linux.misc Subject: Re: pdf & O.C.R ? Date: Sat, 23 May 2015 20:46:03 -0400 Lines: 27 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: individual.net SWlP+RX26H5bo/uuQGfanwi2JdSu6JIhKzbbdDbqX1hNIS0dHb X-Orig-Path: mail.binaryfoundry.ca!not-for-mail Cancel-Lock: sha1:S1pYfAZMaxUECTQG0jrV47d4mY8= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0 In-Reply-To: Xref: csiph.com comp.os.linux.misc:14850 On 23/05/15 03:49 AM, Unknown wrote: > I'm confused and disturbed that xpdf of: > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf > is perfect to the pixel, with maximum magnification [400%], > which is expected, since it's computer-font generated, whereas: > http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% > 20LAW%20ACT.pdf > shows blotchy and fibers as if it's a photo-of-a-paper-copy. > > And scanned copies of papers are apparently normal. > > BUT!! How is it that xpdf allows me to extract the text, via mouse-copy > from COMPANY%20LAW%20ACT.pdf ? > That would mean that the mouse-driver is doing O.C.R. ?! Why would you think the mouse driver is doing OCR? A PDF file can contain both text and images. It is common when scanning paper documents to turn them into a so-called "searchable PDF" that contains the scanned image of the page overlaid on top of the (OCRed) text. So what you see visually is the (possibly blurry) picture, while what the mouse is copying (and pdftotext is extracting) is the text that's hidden underneath. Adobe's own Acrobat software can create such "searchable PDF" files. I'm sure there are other tools, too.