Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!newsfeed0.kamp.net!newsfeed.kamp.net!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: John-Paul Stewart Newsgroups: comp.os.linux.misc Subject: Re: pdf & O.C.R ? Date: Fri, 29 May 2015 20:31:23 -0400 Lines: 47 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Trace: individual.net 0qA3qU/xuVj+1BUnroCUDgrIEtt/uSGEUjT3Ub/AG2lSppp0fC X-Orig-Path: mail.binaryfoundry.ca!not-for-mail Cancel-Lock: sha1:BKB2ZPisnrJGV8DSlGpvZ+csn04= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0 In-Reply-To: Xref: csiph.com comp.os.linux.misc:14883 On 27/05/15 01:10 PM, Unknown wrote: > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > >> On 23/05/15 03:49 AM, Unknown wrote: >>> I'm confused and disturbed that xpdf of: >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >>> is perfect to the pixel, with maximum magnification [400%], which is >>> expected, since it's computer-font generated, whereas: >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >>> 20LAW%20ACT.pdf >>> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >>> >>> And scanned copies of papers are apparently normal. >>> >>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >>> from COMPANY%20LAW%20ACT.pdf ? >>> That would mean that the mouse-driver is doing O.C.R. ?! >> >> Why would you think the mouse driver is doing OCR? >> > OBVIOUSLY from my description there's pdftotext happening via mouse. No, not at all. The mouse is merely selecting/copying some text that is already there. >> A PDF file can contain both text and images. It is common when scanning >> paper documents to turn them into a so-called "searchable PDF" that >> contains the scanned image of the page overlaid on top of the (OCRed) >> text. So what you see visually is the (possibly blurry) picture, while >> what the mouse is copying (and pdftotext is extracting) is the text >> that's hidden underneath. >> >> Adobe's own Acrobat software can create such "searchable PDF" files. I'm >> sure there are other tools, too. > > What extreme deception. There's a layer of pixel-perfect pdftotext-able, > covered by the blurry photo-image ?! Again, no. There's no deception and nothing "pixel-perfect" about the hidden text. It will have lost nearly all formatting during the OCR process. The text itself might be plain wrong (about 90% accurate, IME), due to the complexities of OCR on poor quality images. The image of a scanned document will be more reliable for a human reader than the OCRed text. The text is provided in (some, so-called "searchable") PDFs as a convenience for searching within the PDF file and isn't guaranteed to be accurate. (Talking only about PDFs created from scanned documents.)