Path: csiph.com!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: "Carlos E.R." Newsgroups: comp.os.linux.misc Subject: Re: PDF and OCR Date: Thu, 25 Sep 2025 14:09:43 +0200 Lines: 88 Message-ID: <71ofqlxjpt.ln2@Telcontar.valinor> References: <10a68ql$16tjt$1@dont-email.me> <68c6bbc5$0$402$426a74cc@news.free.fr> <10a6rp4$1d082$5@dont-email.me> <2d9jplxvcn.ln2@Telcontar.valinor> <10a6t8d$1d082$8@dont-email.me> <4cnjplxbgm.ln2@Telcontar.valinor> <101fck52laaigefq5tubi6i7b0qpccmuic@4ax.com> <9DOdncYo-vBzE1r1nZ2dnZfqnPadnZ2d@giganews.com> <10a8mbc$1q6g1$8@dont-email.me> <10aca8j$2odbt$1@dont-email.me> <20250916133411.00001c32@gmail.com> <10ae1n4$34dgl$7@dont-email.me> <10aj95j$dlc4$1@dont-email.me> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl> <10b2uoi$67lv$1@dont-email.me> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: individual.net Ug7f0K8V6+HmZm6CigUHsQ7gpfc0b6OK5zcrlqA7zNfqgC7rrl X-Orig-Path: Telcontar.valinor!not-for-mail Cancel-Lock: sha1:gd23wIFWqgzX8Ozn9gofk9hBP7s= sha256:mkhDj/7R1caOkWTHYB5AS0pwtBBpV1XIDcOUXZtOTEE= User-Agent: Mozilla Thunderbird Content-Language: es-ES, en-CA In-Reply-To: <10b2uoi$67lv$1@dont-email.me> Xref: csiph.com comp.os.linux.misc:75198 On 2025-09-25 10:33, Pancho wrote: > On 9/24/25 19:43, J. J. Lodder wrote: >> Pancho wrote: >> >>> On 9/19/25 21:36, Carlos E.R. wrote: >>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote: >>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.: >>>>> >>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/ >>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf >>>>>> >>>>>> There is something weird with this PDF. Text looks fuzzy, as if the >>>>>> page was scanned or photocopied. But the text is selectable and >>>>>> searchable. >>>>> >>>>> It's even weirder. If you select some of the text, it will change. >>>>> Here's the last part of the first section copied: >>>>> >>>>>         defined by a program.: >>>>>         rned heade'r asshown below: >>>>> >>>> (radial paths) in sectors (arc sections) defined by a program.: >>>> rned heade'r asshown below: >>>> >>>> Right, I see it. Should be: >>>> >>>> (radial paths) in sectors (arc sections) defined by a program- >>>> med header as shown below: >>>> >>>> So it is an OCR with the text placed in the exact location of the >>>> graphics. I knew about PDFs with OCR but never saw one. I thought it >>>> was >>>> different "files" inside the pdf. >>>> >>>> Very nice, even if not perfect. >>>> >>>> I guess we can not do this in Linux :-? >>>> >>> >>> I'm not really following the thread, but you can add OCR to PDF files >>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner. >> >> It isn't really OCR. >> PDF files, that is, real ones, >> contain information to put characters in their proper places. >> >> The characters are called by name, and converted into pixels >> only for a particular rendering. >> So the information is already inherent in the pdf, >> it only has to be put into usable form. >> Linux has nothing to do with it, pdf, based on Postscript, >> is a programming language in its own right, >> > > PDF files can be constructed from text, as you described, but they can > also be constructed from images, jpeg, tiff, etc. > > When I scan documents, it creates tiff or PDF files, there is no OCR, > the files are effectively images, pictures. The problem is that you > can't perform a text search on these scanned documents. > > OCR can take these document pictures and create a text file from them. > The way this is implemented in PDF OCR is that the original picture file > is kept, and a new text file is added to it. Each word in the text file > is then mapped to the coordinate position of the word in the image file > it was created from. Hence, it appears to the user as if the image of a > word is mapped to the text version of the word. The positioning isn't > perfect, like you would see with a genuine text document, but it is good > enough to use. > > As I said, OCRmyPDF works on Linux OCRmyPDF>. In Debian (or PI OS) it can be installed via apt. I've used > it for years. My scanner writes scans to a network folder. A docker > service runs on a PI4 watching the folders. When it sees a new file it > applies OCRmyPDF to it to create a searchable PDF file which it moves to > another folder. > > This process isn't perfect, occasionally OCRmyPDF crashes, but it is > pretty good. Most of the problems I had, were to do with watching the > folder, and proper document flow, organising proper states etc. But that > is due to not having the time to devote to setting it up robustly, > rather than a problem with OCRmyPDF. Ok, I'm taking note of that software. -- Cheers, Carlos. ES🇪🇸, EU🇪🇺;