Groups | Search | Server Info | Login | Register

Re: PDF and OCR

From	"Carlos E.R." <robin_listas@es.invalid>
Newsgroups	comp.os.linux.misc
Subject	Re: PDF and OCR
Date	2025-09-25 14:09 +0200
Message-ID	<71ofqlxjpt.ln2@Telcontar.valinor> (permalink)
References	(22 earlier) <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl> <10b2uoi$67lv$1@dont-email.me>

Show all headers | View raw

On 2025-09-25 10:33, Pancho wrote:
> On 9/24/25 19:43, J. J. Lodder wrote:
>> Pancho <Pancho.Jones@protonmail.com> wrote:
>>
>>> On 9/19/25 21:36, Carlos E.R. wrote:
>>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote:
>>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.:
>>>>>
>>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/
>>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf
>>>>>>
>>>>>> There is something weird with this PDF. Text looks fuzzy, as if the
>>>>>> page was scanned or photocopied. But the text is selectable and
>>>>>> searchable.
>>>>>
>>>>> It's even weirder. If you select some of the text, it will change.
>>>>> Here's the last part of the first section copied:
>>>>>
>>>>>         defined by a program.:
>>>>>         rned heade'r asshown below:
>>>>>
>>>> (radial paths) in sectors (arc sections) defined by a program.:
>>>> rned heade'r asshown below:
>>>>
>>>> Right, I see it. Should be:
>>>>
>>>> (radial paths) in sectors (arc sections) defined by a program-
>>>> med header as shown below:
>>>>
>>>> So it is an OCR with the text placed in the exact location of the
>>>> graphics. I knew about PDFs with OCR but never saw one. I thought it 
>>>> was
>>>> different "files" inside the pdf.
>>>>
>>>> Very nice, even if not perfect.
>>>>
>>>> I guess we can not do this in Linux :-?
>>>>
>>>
>>> I'm not really following the thread, but you can add OCR to PDF files
>>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner.
>>
>> It isn't really OCR.
>> PDF files, that is, real ones,
>> contain information to put characters in their proper places.
>>
>> The characters are called by name, and converted into pixels
>> only for a particular rendering.
>> So the information is already inherent in the pdf,
>> it only has to be put into usable form.
>> Linux has nothing to do with it, pdf, based on Postscript,
>> is a programming language in its own right,
>>
> 
> PDF files can be constructed from text, as you described, but they can 
> also be constructed from images, jpeg, tiff, etc.
> 
> When I scan documents, it creates tiff or PDF files, there is no OCR, 
> the files are effectively images, pictures. The problem is that you 
> can't perform a text search on these scanned documents.
> 
> OCR can take these document pictures and create a text file from them. 
> The way this is implemented in PDF OCR is that the original picture file 
> is kept, and a new text file is added to it. Each word in the text file 
> is then mapped to the coordinate position of the word in the image file 
> it was created from. Hence, it appears to the user as if the image of a 
> word is mapped to the text version of the word. The positioning isn't 
> perfect, like you would see with a genuine text document, but it is good 
> enough to use.
> 
> As I said, OCRmyPDF works on Linux <https://github.com/ocrmypdf/ 
> OCRmyPDF>. In Debian (or PI OS) it can be installed via apt. I've used 
> it for years. My scanner writes scans to a network folder. A docker 
> service runs on a PI4 watching the folders. When it sees a new file it 
> applies OCRmyPDF to it to create a searchable PDF file which it moves to 
> another folder.
> 
> This process isn't perfect, occasionally OCRmyPDF crashes, but it is 
> pretty good. Most of the problems I had, were to do with watching the 
> folder, and proper document flow, organising proper states etc. But that 
> is due to not having the time to devote to setting it up robustly, 
> rather than a problem with OCRmyPDF.

Ok, I'm taking note of that software.

-- 
Cheers, Carlos.
ES🇪🇸, EU🇪🇺;

Back to comp.os.linux.misc | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
  Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
    Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
    Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
      Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
    Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100

csiph-web