Groups | Search | Server Info | Login | Register

Re: PDF and OCR

From	"Carlos E.R." <robin_listas@es.invalid>
Newsgroups	comp.os.linux.misc
Subject	Re: PDF and OCR
Date	2025-09-24 22:26 +0200
Message-ID	<2o0eqlxbki.ln2@Telcontar.valinor> (permalink)
References	(22 earlier) <jn9uplxkof.ln2@Telcontar.valinor> <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl>

Show all headers | View raw

On 2025-09-24 20:43, J. J. Lodder wrote:
> Pancho <Pancho.Jones@protonmail.com> wrote:
> 
>> On 9/19/25 21:36, Carlos E.R. wrote:
>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote:
>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.:
>>>>
>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/
>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf
>>>>>
>>>>> There is something weird with this PDF. Text looks fuzzy, as if the
>>>>> page was scanned or photocopied. But the text is selectable and
>>>>> searchable.
>>>>
>>>> It's even weirder. If you select some of the text, it will change.
>>>> Here's the last part of the first section copied:
>>>>
>>>>         defined by a program.:
>>>>         rned heade'r asshown below:
>>>>
>>> (radial paths) in sectors (arc sections) defined by a program.:
>>> rned heade'r asshown below:
>>>
>>> Right, I see it. Should be:
>>>
>>> (radial paths) in sectors (arc sections) defined by a program-
>>> med header as shown below:
>>>
>>> So it is an OCR with the text placed in the exact location of the
>>> graphics. I knew about PDFs with OCR but never saw one. I thought it was
>>> different "files" inside the pdf.
>>>
>>> Very nice, even if not perfect.
>>>
>>> I guess we can not do this in Linux :-?
>>>
>>
>> I'm not really following the thread, but you can add OCR to PDF files
>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner.
> 
> It isn't really OCR.
> PDF files, that is, real ones,
> contain information to put characters in their proper places.
> 
> The characters are called by name, and converted into pixels
> only for a particular rendering.
> So the information is already inherent in the pdf,
> it only has to be put into usable form.
> Linux has nothing to do with it, pdf, based on Postscript,
> is a programming language in its own right,


No, that is not what I was referring to.

I was referring to scanning papers, and generating a PDF that contains 
both the scanned image and the OCR obtained of it, in a manner that we 
can select portions of the image with the mouse and get the 
corresponding text instead. The PDF linked at the start of this post is 
a perfect example.

And specifically doing that in Linux, locally, not using an online service.

-- 
Cheers, Carlos.
ES🇪🇸, EU🇪🇺;

Back to comp.os.linux.misc | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
  Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
    Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
    Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
      Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
    Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100

csiph-web