Groups | Search | Server Info | Login | Register
Groups > comp.os.linux.misc > #75198
| From | "Carlos E.R." <robin_listas@es.invalid> |
|---|---|
| Newsgroups | comp.os.linux.misc |
| Subject | Re: PDF and OCR |
| Date | 2025-09-25 14:09 +0200 |
| Message-ID | <71ofqlxjpt.ln2@Telcontar.valinor> (permalink) |
| References | (22 earlier) <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl> <10b2uoi$67lv$1@dont-email.me> |
On 2025-09-25 10:33, Pancho wrote: > On 9/24/25 19:43, J. J. Lodder wrote: >> Pancho <Pancho.Jones@protonmail.com> wrote: >> >>> On 9/19/25 21:36, Carlos E.R. wrote: >>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote: >>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.: >>>>> >>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/ >>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf >>>>>> >>>>>> There is something weird with this PDF. Text looks fuzzy, as if the >>>>>> page was scanned or photocopied. But the text is selectable and >>>>>> searchable. >>>>> >>>>> It's even weirder. If you select some of the text, it will change. >>>>> Here's the last part of the first section copied: >>>>> >>>>> defined by a program.: >>>>> rned heade'r asshown below: >>>>> >>>> (radial paths) in sectors (arc sections) defined by a program.: >>>> rned heade'r asshown below: >>>> >>>> Right, I see it. Should be: >>>> >>>> (radial paths) in sectors (arc sections) defined by a program- >>>> med header as shown below: >>>> >>>> So it is an OCR with the text placed in the exact location of the >>>> graphics. I knew about PDFs with OCR but never saw one. I thought it >>>> was >>>> different "files" inside the pdf. >>>> >>>> Very nice, even if not perfect. >>>> >>>> I guess we can not do this in Linux :-? >>>> >>> >>> I'm not really following the thread, but you can add OCR to PDF files >>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner. >> >> It isn't really OCR. >> PDF files, that is, real ones, >> contain information to put characters in their proper places. >> >> The characters are called by name, and converted into pixels >> only for a particular rendering. >> So the information is already inherent in the pdf, >> it only has to be put into usable form. >> Linux has nothing to do with it, pdf, based on Postscript, >> is a programming language in its own right, >> > > PDF files can be constructed from text, as you described, but they can > also be constructed from images, jpeg, tiff, etc. > > When I scan documents, it creates tiff or PDF files, there is no OCR, > the files are effectively images, pictures. The problem is that you > can't perform a text search on these scanned documents. > > OCR can take these document pictures and create a text file from them. > The way this is implemented in PDF OCR is that the original picture file > is kept, and a new text file is added to it. Each word in the text file > is then mapped to the coordinate position of the word in the image file > it was created from. Hence, it appears to the user as if the image of a > word is mapped to the text version of the word. The positioning isn't > perfect, like you would see with a genuine text document, but it is good > enough to use. > > As I said, OCRmyPDF works on Linux <https://github.com/ocrmypdf/ > OCRmyPDF>. In Debian (or PI OS) it can be installed via apt. I've used > it for years. My scanner writes scans to a network folder. A docker > service runs on a PI4 watching the folders. When it sees a new file it > applies OCRmyPDF to it to create a searchable PDF file which it moves to > another folder. > > This process isn't perfect, occasionally OCRmyPDF crashes, but it is > pretty good. Most of the problems I had, were to do with watching the > folder, and proper document flow, organising proper states etc. But that > is due to not having the time to devote to setting it up robustly, > rather than a problem with OCRmyPDF. Ok, I'm taking note of that software. -- Cheers, Carlos. ES🇪🇸, EU🇪🇺;
Back to comp.os.linux.misc | Previous | Next — Previous in thread | Next in thread | Find similar
Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100
csiph-web