Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.os.linux.misc > #14841 > unrolled thread
| Started by | Unknown <dog@gmail.com> |
|---|---|
| First post | 2015-05-23 07:49 +0000 |
| Last post | 2015-05-29 20:31 -0400 |
| Articles | 9 — 5 participants |
Back to article view | Back to comp.os.linux.misc
pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400
| From | Unknown <dog@gmail.com> |
|---|---|
| Date | 2015-05-23 07:49 +0000 |
| Subject | pdf & O.C.R ? |
| Message-ID | <pan.2015.05.23.07.50.46@gmail.com> |
I'm confused and disturbed that xpdf of: http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf is perfect to the pixel, with maximum magnification [400%], which is expected, since it's computer-font generated, whereas: http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% 20LAW%20ACT.pdf shows blotchy and fibers as if it's a photo-of-a-paper-copy. And scanned copies of papers are apparently normal. BUT!! How is it that xpdf allows me to extract the text, via mouse-copy from COMPANY%20LAW%20ACT.pdf ? That would mean that the mouse-driver is doing O.C.R. ?! And mc's viewer [which uses <pdftotext> ] reads this text. Is this some new O.C.R. which I could use on jpg-ed pages of text? ==Thanks for any answers.
[toc] | [next] | [standalone]
| From | Bob Tennent <BobT@cs.queensu.ca> |
|---|---|
| Date | 2015-05-23 11:13 +0000 |
| Message-ID | <slrnmm0o6n.et2.BobT@linus.cs.queensu.ca> |
| In reply to | #14841 |
On Sat, 23 May 2015 07:49:37 +0000 (UTC), Unknown wrote: > I'm confused and disturbed that xpdf of: > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf > is perfect to the pixel, with maximum magnification [400%], > which is expected, since it's computer-font generated, whereas: > http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% > 20LAW%20ACT.pdf > shows blotchy and fibers as if it's a photo-of-a-paper-copy. That second link is broken. Google found http://www.legislation.govt.nz/act/public/1993/0105/latest/096be8ed8109c926.pdf which is fine at 400%. Bob T,
[toc] | [prev] | [next] | [standalone]
| From | Unknown <dog@gmail.com> |
|---|---|
| Date | 2015-05-27 17:11 +0000 |
| Message-ID | <pan.2015.05.27.17.11.57@gmail.com> |
| In reply to | #14843 |
On Sat, 23 May 2015 11:13:27 +0000, Bob Tennent wrote: > On Sat, 23 May 2015 07:49:37 +0000 (UTC), Unknown wrote: > > I'm confused and disturbed that xpdf of: > > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf > > is perfect to the pixel, with maximum magnification [400%], which > > is expected, since it's computer-font generated, whereas: > > http://www.northernlaw.co.za/images/stories/files/actsandbills/ COMPANY% > > 20LAW%20ACT.pdf > > shows blotchy and fibers as if it's a photo-of-a-paper-copy. > > That second link is broken. Google found > > http://www.legislation.govt.nz/act/public/1993/0105/ latest/096be8ed8109c926.pdf > > which is fine at 400%. > > Bob T, Did you see that the long-URL is folded over 2 lines -- now ? John-Paul Stewart gives a spooky explanation.
[toc] | [prev] | [next] | [standalone]
| From | John-Paul Stewart <jpstewart@sympatico.ca> |
|---|---|
| Date | 2015-05-23 20:46 -0400 |
| Message-ID | <bjj73c-fim.ln1@mail.binaryfoundry.ca> |
| In reply to | #14841 |
On 23/05/15 03:49 AM, Unknown wrote: > I'm confused and disturbed that xpdf of: > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf > is perfect to the pixel, with maximum magnification [400%], > which is expected, since it's computer-font generated, whereas: > http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% > 20LAW%20ACT.pdf > shows blotchy and fibers as if it's a photo-of-a-paper-copy. > > And scanned copies of papers are apparently normal. > > BUT!! How is it that xpdf allows me to extract the text, via mouse-copy > from COMPANY%20LAW%20ACT.pdf ? > That would mean that the mouse-driver is doing O.C.R. ?! Why would you think the mouse driver is doing OCR? A PDF file can contain both text and images. It is common when scanning paper documents to turn them into a so-called "searchable PDF" that contains the scanned image of the page overlaid on top of the (OCRed) text. So what you see visually is the (possibly blurry) picture, while what the mouse is copying (and pdftotext is extracting) is the text that's hidden underneath. Adobe's own Acrobat software can create such "searchable PDF" files. I'm sure there are other tools, too.
[toc] | [prev] | [next] | [standalone]
| From | Joe Beanfish <joebeanfish@nospam.duh> |
|---|---|
| Date | 2015-05-26 13:26 +0000 |
| Message-ID | <mk1sam$r2l$1@dont-email.me> |
| In reply to | #14850 |
On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > On 23/05/15 03:49 AM, Unknown wrote: >> I'm confused and disturbed that xpdf of: >> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >> is perfect to the pixel, with maximum magnification [400%], which is >> expected, since it's computer-font generated, whereas: >> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >> 20LAW%20ACT.pdf >> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >> >> And scanned copies of papers are apparently normal. >> >> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >> from COMPANY%20LAW%20ACT.pdf ? >> That would mean that the mouse-driver is doing O.C.R. ?! > > Why would you think the mouse driver is doing OCR? > > A PDF file can contain both text and images. It is common when scanning > paper documents to turn them into a so-called "searchable PDF" that > contains the scanned image of the page overlaid on top of the (OCRed) > text. So what you see visually is the (possibly blurry) picture, while > what the mouse is copying (and pdftotext is extracting) is the text > that's hidden underneath. > > Adobe's own Acrobat software can create such "searchable PDF" files. I'm > sure there are other tools, too. Yeah, It's kinda interesting when your workstation's bogged down and the pdf is big you might see the OCR text render first, then the image will render, covering it up. Or maybe that only happens in the browser when it's downloading and hasn't gotten to the image yet? Haven't seen it happen in a while.
[toc] | [prev] | [next] | [standalone]
| From | Unknown <dog@gmail.com> |
|---|---|
| Date | 2015-06-13 13:29 +0000 |
| Message-ID | <pan.2015.06.13.13.30.56@gmail.com> |
| In reply to | #14876 |
On Tue, 26 May 2015 13:26:46 +0000, Joe Beanfish wrote: > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > >> On 23/05/15 03:49 AM, Unknown wrote: >>> I'm confused and disturbed that xpdf of: >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >>> is perfect to the pixel, with maximum magnification [400%], which >>> is expected, since it's computer-font generated, whereas: >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY % >>> 20LAW%20ACT.pdf >>> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >>> >>> And scanned copies of papers are apparently normal. >>> >>> BUT!! How is it that xpdf allows me to extract the text, via >>> mouse-copy from COMPANY%20LAW%20ACT.pdf ? >>> That would mean that the mouse-driver is doing O.C.R. ?! >> >> Why would you think the mouse driver is doing OCR? >> >> A PDF file can contain both text and images. It is common when >> scanning paper documents to turn them into a so-called "searchable PDF" >> that contains the scanned image of the page overlaid on top of the >> (OCRed) text. So what you see visually is the (possibly blurry) >> picture, while what the mouse is copying (and pdftotext is extracting) >> is the text that's hidden underneath. >> >> Adobe's own Acrobat software can create such "searchable PDF" files. >> I'm sure there are other tools, too. > This is TOO-MUCH!! You mean they send the original-keyed-in-pdftotextable, AND the graphical image of the crumpled-paper-version <overlaid>. What's the aim of such expensive deception? > Yeah, It's kinda interesting when your workstation's bogged down and the > pdf is big you might see the OCR text render first, then the image will > render, covering it up. Or maybe that only happens in the browser when > it's downloading and hasn't gotten to the image yet? Haven't seen it > happen in a while.
[toc] | [prev] | [next] | [standalone]
| From | Robert Heller <heller@deepsoft.com> |
|---|---|
| Date | 2015-06-13 12:52 -0500 |
| Message-ID | <CbudnbZdRv1Q8OHInZ2dnUU7-SmdnZ2d@giganews.com> |
| In reply to | #14921 |
At Sat, 13 Jun 2015 13:29:22 +0000 (UTC) Unknown <dog@gmail.com> wrote:
>
> On Tue, 26 May 2015 13:26:46 +0000, Joe Beanfish wrote:
>
> > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
> >
> >> On 23/05/15 03:49 AM, Unknown wrote:
> >>> I'm confused and disturbed that xpdf of:
> >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
> >>> is perfect to the pixel, with maximum magnification [400%], which
> >>> is expected, since it's computer-font generated, whereas:
> >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY
> %
> >>> 20LAW%20ACT.pdf
> >>> shows blotchy and fibers as if it's a photo-of-a-paper-copy.
> >>>
> >>> And scanned copies of papers are apparently normal.
> >>>
> >>> BUT!! How is it that xpdf allows me to extract the text, via
> >>> mouse-copy from COMPANY%20LAW%20ACT.pdf ?
> >>> That would mean that the mouse-driver is doing O.C.R. ?!
> >>
> >> Why would you think the mouse driver is doing OCR?
> >>
> >> A PDF file can contain both text and images. It is common when
> >> scanning paper documents to turn them into a so-called "searchable PDF"
> >> that contains the scanned image of the page overlaid on top of the
> >> (OCRed) text. So what you see visually is the (possibly blurry)
> >> picture, while what the mouse is copying (and pdftotext is extracting)
> >> is the text that's hidden underneath.
> >>
> >> Adobe's own Acrobat software can create such "searchable PDF" files.
> >> I'm sure there are other tools, too.
> >
> This is TOO-MUCH!!
> You mean they send the original-keyed-in-pdftotextable, AND the graphical
> image of the crumpled-paper-version <overlaid>. What's the aim of such
> expensive deception?
The text is not necessarilied keyed in. It could in fact be 'crudely' OCRed,
which means it won't be accruate. The 'original' has additional features, like
signatures, seals, or original handwriting. Remember handwriting? You know,
back in the old days before people had computer keyboards they actually wrote
stuff down with these things called 'pens'. Also, sometimes the
'crumpled-paper-version' might include non-textual content -- drawings, maps,
charts, etc.
>
> > Yeah, It's kinda interesting when your workstation's bogged down and the
> > pdf is big you might see the OCR text render first, then the image will
> > render, covering it up. Or maybe that only happens in the browser when
> > it's downloading and hasn't gotten to the image yet? Haven't seen it
> > happen in a while.
>
>
--
Robert Heller -- 978-544-6933
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
heller@deepsoft.com -- Webhosting Services
[toc] | [prev] | [next] | [standalone]
| From | Unknown <dog@gmail.com> |
|---|---|
| Date | 2015-05-27 17:10 +0000 |
| Message-ID | <pan.2015.05.27.17.12.03@gmail.com> |
| In reply to | #14850 |
On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > On 23/05/15 03:49 AM, Unknown wrote: >> I'm confused and disturbed that xpdf of: >> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >> is perfect to the pixel, with maximum magnification [400%], which is >> expected, since it's computer-font generated, whereas: >> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >> 20LAW%20ACT.pdf >> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >> >> And scanned copies of papers are apparently normal. >> >> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >> from COMPANY%20LAW%20ACT.pdf ? >> That would mean that the mouse-driver is doing O.C.R. ?! > > Why would you think the mouse driver is doing OCR? > OBVIOUSLY from my description there's pdftotext happening via mouse. > A PDF file can contain both text and images. It is common when scanning > paper documents to turn them into a so-called "searchable PDF" that > contains the scanned image of the page overlaid on top of the (OCRed) > text. So what you see visually is the (possibly blurry) picture, while > what the mouse is copying (and pdftotext is extracting) is the text > that's hidden underneath. > > Adobe's own Acrobat software can create such "searchable PDF" files. I'm > sure there are other tools, too. What extreme deception. There's a layer of pixel-perfect pdftotext-able, covered by the blurry photo-image ?!
[toc] | [prev] | [next] | [standalone]
| From | John-Paul Stewart <jpstewart@sympatico.ca> |
|---|---|
| Date | 2015-05-29 20:31 -0400 |
| Message-ID | <rvcn3c-jg9.ln1@mail.binaryfoundry.ca> |
| In reply to | #14880 |
On 27/05/15 01:10 PM, Unknown wrote: > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote: > >> On 23/05/15 03:49 AM, Unknown wrote: >>> I'm confused and disturbed that xpdf of: >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf >>> is perfect to the pixel, with maximum magnification [400%], which is >>> expected, since it's computer-font generated, whereas: >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY% >>> 20LAW%20ACT.pdf >>> shows blotchy and fibers as if it's a photo-of-a-paper-copy. >>> >>> And scanned copies of papers are apparently normal. >>> >>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy >>> from COMPANY%20LAW%20ACT.pdf ? >>> That would mean that the mouse-driver is doing O.C.R. ?! >> >> Why would you think the mouse driver is doing OCR? >> > OBVIOUSLY from my description there's pdftotext happening via mouse. No, not at all. The mouse is merely selecting/copying some text that is already there. >> A PDF file can contain both text and images. It is common when scanning >> paper documents to turn them into a so-called "searchable PDF" that >> contains the scanned image of the page overlaid on top of the (OCRed) >> text. So what you see visually is the (possibly blurry) picture, while >> what the mouse is copying (and pdftotext is extracting) is the text >> that's hidden underneath. >> >> Adobe's own Acrobat software can create such "searchable PDF" files. I'm >> sure there are other tools, too. > > What extreme deception. There's a layer of pixel-perfect pdftotext-able, > covered by the blurry photo-image ?! Again, no. There's no deception and nothing "pixel-perfect" about the hidden text. It will have lost nearly all formatting during the OCR process. The text itself might be plain wrong (about 90% accurate, IME), due to the complexities of OCR on poor quality images. The image of a scanned document will be more reliable for a human reader than the OCRed text. The text is provided in (some, so-called "searchable") PDFs as a convenience for searching within the PDF file and isn't guaranteed to be accurate. (Talking only about PDFs created from scanned documents.)
[toc] | [prev] | [standalone]
Back to top | Article view | comp.os.linux.misc
csiph-web