Groups > comp.os.linux.misc > #14841 > unrolled thread

pdf & O.C.R ?

Started by	Unknown <dog@gmail.com>
First post	2015-05-23 07:49 +0000
Last post	2015-05-29 20:31 -0400
Articles	9 — 5 participants

Back to article view | Back to comp.os.linux.misc

  pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-23 07:49 +0000
    Re: pdf & O.C.R ? Bob Tennent <BobT@cs.queensu.ca> - 2015-05-23 11:13 +0000
      Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:11 +0000
    Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-23 20:46 -0400
      Re: pdf & O.C.R ? Joe Beanfish <joebeanfish@nospam.duh> - 2015-05-26 13:26 +0000
        Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-06-13 13:29 +0000
          Re: pdf & O.C.R ? Robert Heller <heller@deepsoft.com> - 2015-06-13 12:52 -0500
      Re: pdf & O.C.R ? Unknown <dog@gmail.com> - 2015-05-27 17:10 +0000
        Re: pdf & O.C.R ? John-Paul Stewart <jpstewart@sympatico.ca> - 2015-05-29 20:31 -0400

#14841 — pdf & O.C.R ?

From	Unknown <dog@gmail.com>
Date	2015-05-23 07:49 +0000
Subject	pdf & O.C.R ?
Message-ID	<pan.2015.05.23.07.50.46@gmail.com>

I'm confused and disturbed that xpdf of:
http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
  is perfect to the pixel, with maximum magnification [400%],
  which is expected, since it's computer-font generated, whereas:
http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
20LAW%20ACT.pdf
  shows blotchy and fibers as if it's a photo-of-a-paper-copy.

And scanned copies of papers are apparently normal.

BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
from COMPANY%20LAW%20ACT.pdf ?
That would mean that the mouse-driver is doing O.C.R.   ?!
And mc's viewer [which uses <pdftotext> ] reads this text.

Is this some new O.C.R. which I could use on jpg-ed pages of text?

==Thanks for any answers.

[toc] | [next] | [standalone]

#14843

From	Bob Tennent <BobT@cs.queensu.ca>
Date	2015-05-23 11:13 +0000
Message-ID	<slrnmm0o6n.et2.BobT@linus.cs.queensu.ca>
In reply to	#14841

On Sat, 23 May 2015 07:49:37 +0000 (UTC), Unknown wrote:
 > I'm confused and disturbed that xpdf of:
 > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
 >   is perfect to the pixel, with maximum magnification [400%],
 >   which is expected, since it's computer-font generated, whereas:
 > http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
 > 20LAW%20ACT.pdf
 >   shows blotchy and fibers as if it's a photo-of-a-paper-copy.

That second link is broken. Google found

http://www.legislation.govt.nz/act/public/1993/0105/latest/096be8ed8109c926.pdf

which is fine at 400%.

Bob T,

[toc] | [prev] | [next] | [standalone]

#14881

From	Unknown <dog@gmail.com>
Date	2015-05-27 17:11 +0000
Message-ID	<pan.2015.05.27.17.11.57@gmail.com>
In reply to	#14843

On Sat, 23 May 2015 11:13:27 +0000, Bob Tennent wrote:

> On Sat, 23 May 2015 07:49:37 +0000 (UTC), Unknown wrote:
>  > I'm confused and disturbed that xpdf of:
>  > http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>  >   is perfect to the pixel, with maximum magnification [400%], which
>  >   is expected, since it's computer-font generated, whereas:
>  > http://www.northernlaw.co.za/images/stories/files/actsandbills/
COMPANY%
>  > 20LAW%20ACT.pdf
>  >   shows blotchy and fibers as if it's a photo-of-a-paper-copy.
> 
> That second link is broken. Google found
> 
> http://www.legislation.govt.nz/act/public/1993/0105/
latest/096be8ed8109c926.pdf
> 
> which is fine at 400%.
> 
> Bob T,

Did you see that the long-URL is folded over 2 lines -- now ?
John-Paul Stewart gives a spooky explanation.

[toc] | [prev] | [next] | [standalone]

#14850

From	John-Paul Stewart <jpstewart@sympatico.ca>
Date	2015-05-23 20:46 -0400
Message-ID	<bjj73c-fim.ln1@mail.binaryfoundry.ca>
In reply to	#14841

On 23/05/15 03:49 AM, Unknown wrote:
> I'm confused and disturbed that xpdf of:
> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>    is perfect to the pixel, with maximum magnification [400%],
>    which is expected, since it's computer-font generated, whereas:
> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
> 20LAW%20ACT.pdf
>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>
> And scanned copies of papers are apparently normal.
>
> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
> from COMPANY%20LAW%20ACT.pdf ?
> That would mean that the mouse-driver is doing O.C.R.   ?!

Why would you think the mouse driver is doing OCR?

A PDF file can contain both text and images.  It is common when scanning 
paper documents to turn them into a so-called "searchable PDF" that 
contains the scanned image of the page overlaid on top of the (OCRed) 
text.  So what you see visually is the (possibly blurry) picture, while 
what the mouse is copying (and pdftotext is extracting) is the text 
that's hidden underneath.

Adobe's own Acrobat software can create such "searchable PDF" files. 
I'm sure there are other tools, too.

[toc] | [prev] | [next] | [standalone]

#14876

From	Joe Beanfish <joebeanfish@nospam.duh>
Date	2015-05-26 13:26 +0000
Message-ID	<mk1sam$r2l$1@dont-email.me>
In reply to	#14850

On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:

> On 23/05/15 03:49 AM, Unknown wrote:
>> I'm confused and disturbed that xpdf of:
>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>    is perfect to the pixel, with maximum magnification [400%], which is
>>    expected, since it's computer-font generated, whereas:
>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
>> 20LAW%20ACT.pdf
>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>
>> And scanned copies of papers are apparently normal.
>>
>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
>> from COMPANY%20LAW%20ACT.pdf ?
>> That would mean that the mouse-driver is doing O.C.R.   ?!
> 
> Why would you think the mouse driver is doing OCR?
> 
> A PDF file can contain both text and images.  It is common when scanning
> paper documents to turn them into a so-called "searchable PDF" that
> contains the scanned image of the page overlaid on top of the (OCRed)
> text.  So what you see visually is the (possibly blurry) picture, while
> what the mouse is copying (and pdftotext is extracting) is the text
> that's hidden underneath.
> 
> Adobe's own Acrobat software can create such "searchable PDF" files. I'm
> sure there are other tools, too.

Yeah, It's kinda interesting when your workstation's bogged down and the
pdf is big you might see the OCR text render first, then the image will
render, covering it up. Or maybe that only happens in the browser when
it's downloading and hasn't gotten to the image yet? Haven't seen it
happen in a while.

[toc] | [prev] | [next] | [standalone]

#14921

From	Unknown <dog@gmail.com>
Date	2015-06-13 13:29 +0000
Message-ID	<pan.2015.06.13.13.30.56@gmail.com>
In reply to	#14876

On Tue, 26 May 2015 13:26:46 +0000, Joe Beanfish wrote:

> On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
> 
>> On 23/05/15 03:49 AM, Unknown wrote:
>>> I'm confused and disturbed that xpdf of:
>>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>>    is perfect to the pixel, with maximum magnification [400%], which
>>>    is expected, since it's computer-font generated, whereas:
>>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY
%
>>> 20LAW%20ACT.pdf
>>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>>
>>> And scanned copies of papers are apparently normal.
>>>
>>> BUT!! How is it that xpdf allows me to extract the text, via
>>> mouse-copy from COMPANY%20LAW%20ACT.pdf ?
>>> That would mean that the mouse-driver is doing O.C.R.   ?!
>> 
>> Why would you think the mouse driver is doing OCR?
>> 
>> A PDF file can contain both text and images.  It is common when
>> scanning paper documents to turn them into a so-called "searchable PDF"
>> that contains the scanned image of the page overlaid on top of the
>> (OCRed) text.  So what you see visually is the (possibly blurry)
>> picture, while what the mouse is copying (and pdftotext is extracting)
>> is the text that's hidden underneath.
>> 
>> Adobe's own Acrobat software can create such "searchable PDF" files.
>> I'm sure there are other tools, too.
> 
This is TOO-MUCH!!
You mean they send the original-keyed-in-pdftotextable, AND the graphical
image of the crumpled-paper-version <overlaid>. What's the aim of such
expensive deception?

> Yeah, It's kinda interesting when your workstation's bogged down and the
> pdf is big you might see the OCR text render first, then the image will
> render, covering it up. Or maybe that only happens in the browser when
> it's downloading and hasn't gotten to the image yet? Haven't seen it
> happen in a while.

[toc] | [prev] | [next] | [standalone]

#14922

From	Robert Heller <heller@deepsoft.com>
Date	2015-06-13 12:52 -0500
Message-ID	<CbudnbZdRv1Q8OHInZ2dnUU7-SmdnZ2d@giganews.com>
In reply to	#14921

At Sat, 13 Jun 2015 13:29:22 +0000 (UTC) Unknown <dog@gmail.com> wrote:

> 
> On Tue, 26 May 2015 13:26:46 +0000, Joe Beanfish wrote:
> 
> > On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
> > 
> >> On 23/05/15 03:49 AM, Unknown wrote:
> >>> I'm confused and disturbed that xpdf of:
> >>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
> >>>    is perfect to the pixel, with maximum magnification [400%], which
> >>>    is expected, since it's computer-font generated, whereas:
> >>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY
> %
> >>> 20LAW%20ACT.pdf
> >>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
> >>>
> >>> And scanned copies of papers are apparently normal.
> >>>
> >>> BUT!! How is it that xpdf allows me to extract the text, via
> >>> mouse-copy from COMPANY%20LAW%20ACT.pdf ?
> >>> That would mean that the mouse-driver is doing O.C.R.   ?!
> >> 
> >> Why would you think the mouse driver is doing OCR?
> >> 
> >> A PDF file can contain both text and images.  It is common when
> >> scanning paper documents to turn them into a so-called "searchable PDF"
> >> that contains the scanned image of the page overlaid on top of the
> >> (OCRed) text.  So what you see visually is the (possibly blurry)
> >> picture, while what the mouse is copying (and pdftotext is extracting)
> >> is the text that's hidden underneath.
> >> 
> >> Adobe's own Acrobat software can create such "searchable PDF" files.
> >> I'm sure there are other tools, too.
> > 
> This is TOO-MUCH!!
> You mean they send the original-keyed-in-pdftotextable, AND the graphical
> image of the crumpled-paper-version <overlaid>. What's the aim of such
> expensive deception?

The text is not necessarilied keyed in. It could in fact be 'crudely' OCRed,
which means it won't be accruate. The 'original' has additional features, like
signatures, seals, or original handwriting. Remember handwriting? You know,
back in the old days before people had computer keyboards they actually wrote
stuff down with these things called 'pens'. Also, sometimes the
'crumpled-paper-version' might include non-textual content -- drawings, maps,
charts, etc.


> 
> > Yeah, It's kinda interesting when your workstation's bogged down and the
> > pdf is big you might see the OCR text render first, then the image will
> > render, covering it up. Or maybe that only happens in the browser when
> > it's downloading and hasn't gotten to the image yet? Haven't seen it
> > happen in a while.
> 
>   

-- 
Robert Heller             -- 978-544-6933
Deepwoods Software        -- Custom Software Services
http://www.deepsoft.com/  -- Linux Administration Services
heller@deepsoft.com       -- Webhosting Services

[toc] | [prev] | [next] | [standalone]

#14880

From	Unknown <dog@gmail.com>
Date	2015-05-27 17:10 +0000
Message-ID	<pan.2015.05.27.17.12.03@gmail.com>
In reply to	#14850

On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:

> On 23/05/15 03:49 AM, Unknown wrote:
>> I'm confused and disturbed that xpdf of:
>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>    is perfect to the pixel, with maximum magnification [400%], which is
>>    expected, since it's computer-font generated, whereas:
>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
>> 20LAW%20ACT.pdf
>>    shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>
>> And scanned copies of papers are apparently normal.
>>
>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
>> from COMPANY%20LAW%20ACT.pdf ?
>> That would mean that the mouse-driver is doing O.C.R.   ?!
> 
> Why would you think the mouse driver is doing OCR?
> 
OBVIOUSLY from my description there's pdftotext happening via mouse.

> A PDF file can contain both text and images.  It is common when scanning
> paper documents to turn them into a so-called "searchable PDF" that
> contains the scanned image of the page overlaid on top of the (OCRed)
> text.  So what you see visually is the (possibly blurry) picture, while
> what the mouse is copying (and pdftotext is extracting) is the text
> that's hidden underneath.
> 
> Adobe's own Acrobat software can create such "searchable PDF" files. I'm
> sure there are other tools, too.

What extreme deception. There's a layer of pixel-perfect pdftotext-able,
covered by the blurry photo-image ?!

[toc] | [prev] | [next] | [standalone]

#14883

From	John-Paul Stewart <jpstewart@sympatico.ca>
Date	2015-05-29 20:31 -0400
Message-ID	<rvcn3c-jg9.ln1@mail.binaryfoundry.ca>
In reply to	#14880

On 27/05/15 01:10 PM, Unknown wrote:
> On Sat, 23 May 2015 20:46:03 -0400, John-Paul Stewart wrote:
>
>> On 23/05/15 03:49 AM, Unknown wrote:
>>> I'm confused and disturbed that xpdf of:
>>> http://www.inf.ethz.ch/personal/wirth/ProjectOberon/PO.Computer.pdf
>>>     is perfect to the pixel, with maximum magnification [400%], which is
>>>     expected, since it's computer-font generated, whereas:
>>> http://www.northernlaw.co.za/images/stories/files/actsandbills/COMPANY%
>>> 20LAW%20ACT.pdf
>>>     shows blotchy and fibers as if it's a photo-of-a-paper-copy.
>>>
>>> And scanned copies of papers are apparently normal.
>>>
>>> BUT!! How is it that xpdf allows me to extract the text, via mouse-copy
>>> from COMPANY%20LAW%20ACT.pdf ?
>>> That would mean that the mouse-driver is doing O.C.R.   ?!
>>
>> Why would you think the mouse driver is doing OCR?
>>
> OBVIOUSLY from my description there's pdftotext happening via mouse.

No, not at all.  The mouse is merely selecting/copying some text that is 
already there.

>> A PDF file can contain both text and images.  It is common when scanning
>> paper documents to turn them into a so-called "searchable PDF" that
>> contains the scanned image of the page overlaid on top of the (OCRed)
>> text.  So what you see visually is the (possibly blurry) picture, while
>> what the mouse is copying (and pdftotext is extracting) is the text
>> that's hidden underneath.
>>
>> Adobe's own Acrobat software can create such "searchable PDF" files. I'm
>> sure there are other tools, too.
>
> What extreme deception. There's a layer of pixel-perfect pdftotext-able,
> covered by the blurry photo-image ?!

Again, no.  There's no deception and nothing "pixel-perfect" about the 
hidden text.  It will have lost nearly all formatting during the OCR 
process.  The text itself might be plain wrong (about 90% accurate, 
IME), due to the complexities of OCR on poor quality images.  The image 
of a scanned document will be more reliable for a human reader than the 
OCRed text.  The text is provided in (some, so-called "searchable") PDFs 
as a convenience for searching within the PDF file and isn't guaranteed 
to be accurate.  (Talking only about PDFs created from scanned documents.)

[toc] | [prev] | [standalone]

csiph-web

pdf & O.C.R ?

Contents

#14841 — pdf & O.C.R ?

#14843

#14881

#14850

#14876

#14921

#14922

#14880

#14883