Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.os.linux.misc > #75169

Re: PDF and OCR

From Pancho <Pancho.Jones@protonmail.com>
Newsgroups comp.os.linux.misc
Subject Re: PDF and OCR
Date 2025-09-25 09:33 +0100
Organization A noiseless patient Spider
Message-ID <10b2uoi$67lv$1@dont-email.me> (permalink)
References (22 earlier) <jn9uplxkof.ln2@Telcontar.valinor> <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl>

Show all headers | View raw


On 9/24/25 19:43, J. J. Lodder wrote:
> Pancho <Pancho.Jones@protonmail.com> wrote:
> 
>> On 9/19/25 21:36, Carlos E.R. wrote:
>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote:
>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.:
>>>>
>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/
>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf
>>>>>
>>>>> There is something weird with this PDF. Text looks fuzzy, as if the
>>>>> page was scanned or photocopied. But the text is selectable and
>>>>> searchable.
>>>>
>>>> It's even weirder. If you select some of the text, it will change.
>>>> Here's the last part of the first section copied:
>>>>
>>>>         defined by a program.:
>>>>         rned heade'r asshown below:
>>>>
>>> (radial paths) in sectors (arc sections) defined by a program.:
>>> rned heade'r asshown below:
>>>
>>> Right, I see it. Should be:
>>>
>>> (radial paths) in sectors (arc sections) defined by a program-
>>> med header as shown below:
>>>
>>> So it is an OCR with the text placed in the exact location of the
>>> graphics. I knew about PDFs with OCR but never saw one. I thought it was
>>> different "files" inside the pdf.
>>>
>>> Very nice, even if not perfect.
>>>
>>> I guess we can not do this in Linux :-?
>>>
>>
>> I'm not really following the thread, but you can add OCR to PDF files
>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner.
> 
> It isn't really OCR.
> PDF files, that is, real ones,
> contain information to put characters in their proper places.
> 
> The characters are called by name, and converted into pixels
> only for a particular rendering.
> So the information is already inherent in the pdf,
> it only has to be put into usable form.
> Linux has nothing to do with it, pdf, based on Postscript,
> is a programming language in its own right,
> 

PDF files can be constructed from text, as you described, but they can 
also be constructed from images, jpeg, tiff, etc.

When I scan documents, it creates tiff or PDF files, there is no OCR, 
the files are effectively images, pictures. The problem is that you 
can't perform a text search on these scanned documents.

OCR can take these document pictures and create a text file from them. 
The way this is implemented in PDF OCR is that the original picture file 
is kept, and a new text file is added to it. Each word in the text file 
is then mapped to the coordinate position of the word in the image file 
it was created from. Hence, it appears to the user as if the image of a 
word is mapped to the text version of the word. The positioning isn't 
perfect, like you would see with a genuine text document, but it is good 
enough to use.

As I said, OCRmyPDF works on Linux 
<https://github.com/ocrmypdf/OCRmyPDF>. In Debian (or PI OS) it can be 
installed via apt. I've used it for years. My scanner writes scans to a 
network folder. A docker service runs on a PI4 watching the folders. 
When it sees a new file it applies OCRmyPDF to it to create a searchable 
PDF file which it moves to another folder.

This process isn't perfect, occasionally OCRmyPDF crashes, but it is 
pretty good. Most of the problems I had, were to do with watching the 
folder, and proper document flow, organising proper states etc. But that 
is due to not having the time to devote to setting it up robustly, 
rather than a problem with OCRmyPDF.



> Jan
> 

Back to comp.os.linux.misc | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
  Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
    Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
    Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
      Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
    Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100

csiph-web