Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!1.eu.feeder.erje.net!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Unknown Newsgroups: comp.os.linux.misc Subject: Re: pdf & O.C.R ? Date: Wed, 27 May 2015 17:10:59 +0000 (UTC) Organization: A noiseless patient Spider Lines: 35 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 27 May 2015 17:10:59 +0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="80b788f519e6b9c215c0d3290fb5d315"; logging-data="31172"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+K0nF/KdFY22LjQiXhovbqo2Z76/7FDwA=" User-Agent: Pan/0.133 (House of Butterflies) Cancel-Lock: sha1:nkPKVBnAn4H5b8Ilt4zERu3Nzmg= Xref: csiph.com comp.os.linux.misc:14880 On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > On 23/05/15 03:49 AM, Unknown wrote: >> I'm confused and disturbed that xpdf of: >> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >> is perfect to the pixel, with maximum magnification [400%], which is >> expected, since it's computer-font generated, whereas: >> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >> 20LAW%20ACT.pdf >> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >> >> And scanned copies of papers are apparently normal. >> >> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >> from COMPANY%20LAW%20ACT.pdf ? >> That would mean that the mouse-driver is doing O.C.R. ?! > > Why would you think the mouse driver is doing OCR? > OBVIOUSLY from my description there's pdftotext happening via mouse. > A PDF file can contain both text and images. It is common when scanning > paper documents to turn them into a so-called "searchable PDF" that > contains the scanned image of the page overlaid on top of the (OCRed) > text. So what you see visually is the (possibly blurry) picture, while > what the mouse is copying (and pdftotext is extracting) is the text > that's hidden underneath. > > Adobe's own Acrobat software can create such "searchable PDF" files. I'm > sure there are other tools, too. What extreme deception. There's a layer of pixel-perfect pdftotext-able, covered by the blurry photo-image ?!