Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.os.linux.misc > #14883
| From | John-Paul Stewart <jpstewart@sympatico.ca> |
|---|---|
| Newsgroups | comp.os.linux.misc |
| Subject | Re: pdf & O.C.R ? |
| Date | 2015-05-29 20:31 -0400 |
| Message-ID | <rvcn3c-jg9.ln1@mail.binaryfoundry.ca> (permalink) |
| References | <pan.2015.05.23.07.50.46@gmail.com> <bjj73c-fim.ln1@mail.binaryfoundry.ca> <pan.2015.05.27.17.12.03@gmail.com> |
On 27/05/15 01:10 PM, Unknown wrote: > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > >> On 23/05/15 03:49 AM, Unknown wrote: >>> I'm confused and disturbed that xpdf of: >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >>> is perfect to the pixel, with maximum magnification [400%], which is >>> expected, since it's computer-font generated, whereas: >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >>> 20LAW%20ACT.pdf >>> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >>> >>> And scanned copies of papers are apparently normal. >>> >>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >>> from COMPANY%20LAW%20ACT.pdf ? >>> That would mean that the mouse-driver is doing O.C.R. ?! >> >> Why would you think the mouse driver is doing OCR? >> > OBVIOUSLY from my description there's pdftotext happening via mouse. No, not at all. The mouse is merely selecting/copying some text that is already there. >> A PDF file can contain both text and images. It is common when scanning >> paper documents to turn them into a so-called "searchable PDF" that >> contains the scanned image of the page overlaid on top of the (OCRed) >> text. So what you see visually is the (possibly blurry) picture, while >> what the mouse is copying (and pdftotext is extracting) is the text >> that's hidden underneath. >> >> Adobe's own Acrobat software can create such "searchable PDF" files. I'm >> sure there are other tools, too. > > What extreme deception. There's a layer of pixel-perfect pdftotext-able, > covered by the blurry photo-image ?! Again, no. There's no deception and nothing "pixel-perfect" about the hidden text. It will have lost nearly all formatting during the OCR process. The text itself might be plain wrong (about 90% accurate, IME), due to the complexities of OCR on poor quality images. The image of a scanned document will be more reliable for a human reader than the OCRed text. The text is provided in (some, so-called "searchable") PDFs as a convenience for searching within the PDF file and isn't guaranteed to be accurate. (Talking only about PDFs created from scanned documents.)
Back to comp.os.linux.misc | Previous | Next — Previous in thread | Find similar | Unroll thread
pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400
csiph-web