Path: csiph.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: "Carlos E.R." <robin_listas@es.invalid>
Newsgroups: comp.os.linux.misc
Subject: Re: PDF and OCR
Date: Thu, 25 Sep 2025 14:09:43 +0200
Lines: 88
Message-ID: <71ofqlxjpt.ln2@Telcontar.valinor>
References: <l0kknlxj6r.ln2@Telcontar.valinor> <10a68ql$16tjt$1@dont-email.me> <68c6bbc5$0$402$426a74cc@news.free.fr> <10a6rp4$1d082$5@dont-email.me> <2d9jplxvcn.ln2@Telcontar.valinor> <10a6t8d$1d082$8@dont-email.me> <4cnjplxbgm.ln2@Telcontar.valinor> <mipa1vFg9pkU2@mid.individual.net> <101fck52laaigefq5tubi6i7b0qpccmuic@4ax.com> <lNLxQ.20194$1gR1.17877@fx12.iad> <9DOdncYo-vBzE1r1nZ2dnZfqnPadnZ2d@giganews.com> <10a8mbc$1q6g1$8@dont-email.me> <YtXxQ.220966$oyoc.192887@fx15.iad> <mirchsFqpepU2@mid.individual.net> <flrnplx7qj.ln2@Telcontar.valinor> <10aca8j$2odbt$1@dont-email.me> <20250916133411.00001c32@gmail.com> <fo5pplxp2t.ln2@Telcontar.valinor> <10ae1n4$34dgl$7@dont-email.me> <mn.8a8a7e992b5784a4.127094@snitoo> <mj0ishFmo7vU5@mid.individual.net> <jn9uplxkof.ln2@Telcontar.valinor> <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl> <10b2uoi$67lv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: individual.net Ug7f0K8V6+HmZm6CigUHsQ7gpfc0b6OK5zcrlqA7zNfqgC7rrl
X-Orig-Path: Telcontar.valinor!not-for-mail
Cancel-Lock: sha1:gd23wIFWqgzX8Ozn9gofk9hBP7s= sha256:mkhDj/7R1caOkWTHYB5AS0pwtBBpV1XIDcOUXZtOTEE=
User-Agent: Mozilla Thunderbird
Content-Language: es-ES, en-CA
In-Reply-To: <10b2uoi$67lv$1@dont-email.me>
Xref: csiph.com comp.os.linux.misc:75198

On 2025-09-25 10:33, Pancho wrote:
> On 9/24/25 19:43, J. J. Lodder wrote:
>> Pancho <Pancho.Jones@protonmail.com> wrote:
>>
>>> On 9/19/25 21:36, Carlos E.R. wrote:
>>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote:
>>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.:
>>>>>
>>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/
>>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf
>>>>>>
>>>>>> There is something weird with this PDF. Text looks fuzzy, as if the
>>>>>> page was scanned or photocopied. But the text is selectable and
>>>>>> searchable.
>>>>>
>>>>> It's even weirder. If you select some of the text, it will change.
>>>>> Here's the last part of the first section copied:
>>>>>
>>>>>         defined by a program.:
>>>>>         rned heade'r asshown below:
>>>>>
>>>> (radial paths) in sectors (arc sections) defined by a program.:
>>>> rned heade'r asshown below:
>>>>
>>>> Right, I see it. Should be:
>>>>
>>>> (radial paths) in sectors (arc sections) defined by a program-
>>>> med header as shown below:
>>>>
>>>> So it is an OCR with the text placed in the exact location of the
>>>> graphics. I knew about PDFs with OCR but never saw one. I thought it 
>>>> was
>>>> different "files" inside the pdf.
>>>>
>>>> Very nice, even if not perfect.
>>>>
>>>> I guess we can not do this in Linux :-?
>>>>
>>>
>>> I'm not really following the thread, but you can add OCR to PDF files
>>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner.
>>
>> It isn't really OCR.
>> PDF files, that is, real ones,
>> contain information to put characters in their proper places.
>>
>> The characters are called by name, and converted into pixels
>> only for a particular rendering.
>> So the information is already inherent in the pdf,
>> it only has to be put into usable form.
>> Linux has nothing to do with it, pdf, based on Postscript,
>> is a programming language in its own right,
>>
> 
> PDF files can be constructed from text, as you described, but they can 
> also be constructed from images, jpeg, tiff, etc.
> 
> When I scan documents, it creates tiff or PDF files, there is no OCR, 
> the files are effectively images, pictures. The problem is that you 
> can't perform a text search on these scanned documents.
> 
> OCR can take these document pictures and create a text file from them. 
> The way this is implemented in PDF OCR is that the original picture file 
> is kept, and a new text file is added to it. Each word in the text file 
> is then mapped to the coordinate position of the word in the image file 
> it was created from. Hence, it appears to the user as if the image of a 
> word is mapped to the text version of the word. The positioning isn't 
> perfect, like you would see with a genuine text document, but it is good 
> enough to use.
> 
> As I said, OCRmyPDF works on Linux <https://github.com/ocrmypdf/ 
> OCRmyPDF>. In Debian (or PI OS) it can be installed via apt. I've used 
> it for years. My scanner writes scans to a network folder. A docker 
> service runs on a PI4 watching the folders. When it sees a new file it 
> applies OCRmyPDF to it to create a searchable PDF file which it moves to 
> another folder.
> 
> This process isn't perfect, occasionally OCRmyPDF crashes, but it is 
> pretty good. Most of the problems I had, were to do with watching the 
> folder, and proper document flow, organising proper states etc. But that 
> is due to not having the time to devote to setting it up robustly, 
> rather than a problem with OCRmyPDF.

Ok, I'm taking note of that software.

-- 
Cheers, Carlos.
ES🇪🇸, EU🇪🇺;