Groups | Search | Server Info | Login | Register
Groups > comp.os.linux.misc > #75125
| From | "Carlos E.R." <robin_listas@es.invalid> |
|---|---|
| Newsgroups | comp.os.linux.misc |
| Subject | Re: PDF and OCR |
| Date | 2025-09-24 22:26 +0200 |
| Message-ID | <2o0eqlxbki.ln2@Telcontar.valinor> (permalink) |
| References | (22 earlier) <jn9uplxkof.ln2@Telcontar.valinor> <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl> |
On 2025-09-24 20:43, J. J. Lodder wrote: > Pancho <Pancho.Jones@protonmail.com> wrote: > >> On 9/19/25 21:36, Carlos E.R. wrote: >>> On 2025-09-19 11:52, Bertel Lund Hansen wrote: >>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.: >>>> >>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/ >>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf >>>>> >>>>> There is something weird with this PDF. Text looks fuzzy, as if the >>>>> page was scanned or photocopied. But the text is selectable and >>>>> searchable. >>>> >>>> It's even weirder. If you select some of the text, it will change. >>>> Here's the last part of the first section copied: >>>> >>>> defined by a program.: >>>> rned heade'r asshown below: >>>> >>> (radial paths) in sectors (arc sections) defined by a program.: >>> rned heade'r asshown below: >>> >>> Right, I see it. Should be: >>> >>> (radial paths) in sectors (arc sections) defined by a program- >>> med header as shown below: >>> >>> So it is an OCR with the text placed in the exact location of the >>> graphics. I knew about PDFs with OCR but never saw one. I thought it was >>> different "files" inside the pdf. >>> >>> Very nice, even if not perfect. >>> >>> I guess we can not do this in Linux :-? >>> >> >> I'm not really following the thread, but you can add OCR to PDF files >> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner. > > It isn't really OCR. > PDF files, that is, real ones, > contain information to put characters in their proper places. > > The characters are called by name, and converted into pixels > only for a particular rendering. > So the information is already inherent in the pdf, > it only has to be put into usable form. > Linux has nothing to do with it, pdf, based on Postscript, > is a programming language in its own right, No, that is not what I was referring to. I was referring to scanning papers, and generating a PDF that contains both the scanned image and the OCR obtained of it, in a manner that we can select portions of the image with the mouse and get the corresponding text instead. The PDF linked at the start of this post is a perfect example. And specifically doing that in Linux, locally, not using an online service. -- Cheers, Carlos. ES🇪🇸, EU🇪🇺;
Back to comp.os.linux.misc | Previous | Next — Previous in thread | Next in thread | Find similar
Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100
csiph-web