Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.os.linux.misc > #14850
| From | John-Paul Stewart <jpstewart@sympatico.ca> |
|---|---|
| Newsgroups | comp.os.linux.misc |
| Subject | Re: pdf & O.C.R ? |
| Date | 2015-05-23 20:46 -0400 |
| Message-ID | <bjj73c-fim.ln1@mail.binaryfoundry.ca> (permalink) |
| References | <pan.2015.05.23.07.50.46@gmail.com> |
On 23/05/15 03:49 AM, Unknown wrote: > I'm confused and disturbed that xpdf of: > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf > is perfect to the pixel, with maximum magnification [400%], > which is expected, since it's computer-font generated, whereas: > http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% > 20LAW%20ACT.pdf > shows blotchy and fibers as if it's a photo-of-a-paper-copy. > > And scanned copies of papers are apparently normal. > > BUT!! How is it that xpdf allows me to extract the text, via mouse-copy > from COMPANY%20LAW%20ACT.pdf ? > That would mean that the mouse-driver is doing O.C.R. ?! Why would you think the mouse driver is doing OCR? A PDF file can contain both text and images. It is common when scanning paper documents to turn them into a so-called "searchable PDF" that contains the scanned image of the page overlaid on top of the (OCRed) text. So what you see visually is the (possibly blurry) picture, while what the mouse is copying (and pdftotext is extracting) is the text that's hidden underneath. Adobe's own Acrobat software can create such "searchable PDF" files. I'm sure there are other tools, too.
Back to comp.os.linux.misc | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400
csiph-web