Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.os.linux.misc > #75125

Re: PDF and OCR

Path csiph.com!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From "Carlos E.R." <robin_listas@es.invalid>
Newsgroups comp.os.linux.misc
Subject Re: PDF and OCR
Date Wed, 24 Sep 2025 22:26:10 +0200
Lines 65
Message-ID <2o0eqlxbki.ln2@Telcontar.valinor> (permalink)
References <l0kknlxj6r.ln2@Telcontar.valinor> <mgr05gFenvU2@mid.individual.net> <10a68ql$16tjt$1@dont-email.me> <68c6bbc5$0$402$426a74cc@news.free.fr> <10a6rp4$1d082$5@dont-email.me> <2d9jplxvcn.ln2@Telcontar.valinor> <10a6t8d$1d082$8@dont-email.me> <4cnjplxbgm.ln2@Telcontar.valinor> <mipa1vFg9pkU2@mid.individual.net> <101fck52laaigefq5tubi6i7b0qpccmuic@4ax.com> <lNLxQ.20194$1gR1.17877@fx12.iad> <9DOdncYo-vBzE1r1nZ2dnZfqnPadnZ2d@giganews.com> <10a8mbc$1q6g1$8@dont-email.me> <YtXxQ.220966$oyoc.192887@fx15.iad> <mirchsFqpepU2@mid.individual.net> <flrnplx7qj.ln2@Telcontar.valinor> <10aca8j$2odbt$1@dont-email.me> <20250916133411.00001c32@gmail.com> <fo5pplxp2t.ln2@Telcontar.valinor> <10ae1n4$34dgl$7@dont-email.me> <mn.8a8a7e992b5784a4.127094@snitoo> <mj0ishFmo7vU5@mid.individual.net> <jn9uplxkof.ln2@Telcontar.valinor> <10aj95j$dlc4$1@dont-email.me> <tfr0qlxiup.ln2@Telcontar.valinor> <10aoi3d$1km2b$1@dont-email.me> <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl>
Mime-Version 1.0
Content-Type text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding 8bit
X-Trace individual.net 9Wqhw2TPX7PMpGvma9BhbwRKzRR2Rl5IgKS55xBk2TeeSzsJ0C
X-Orig-Path Telcontar.valinor!not-for-mail
Cancel-Lock sha1:J3qMhrTvsC0qOmGcIGwcMun8+2g= sha256:nIVufT1L47jd9w5vX5AaDXg/lSQYzeDMMVZFMztETjc=
User-Agent Mozilla Thunderbird
Content-Language es-ES, en-CA
In-Reply-To <1rj6lrq.r8xympwcc2q1N%nospam@de-ster.demon.nl>
Xref csiph.com comp.os.linux.misc:75125

Show key headers only | View raw


On 2025-09-24 20:43, J. J. Lodder wrote:
> Pancho <Pancho.Jones@protonmail.com> wrote:
> 
>> On 9/19/25 21:36, Carlos E.R. wrote:
>>> On 2025-09-19 11:52, Bertel Lund Hansen wrote:
>>>> Den 18.09.2025 kl. 23.21 skrev Carlos E.R.:
>>>>
>>>>>> https://hardwarecomputerist.atariverse.com/media/pdf/datasheet/
>>>>>> Western%20Digital%20FD1771%20-%20Specifications.pdf
>>>>>
>>>>> There is something weird with this PDF. Text looks fuzzy, as if the
>>>>> page was scanned or photocopied. But the text is selectable and
>>>>> searchable.
>>>>
>>>> It's even weirder. If you select some of the text, it will change.
>>>> Here's the last part of the first section copied:
>>>>
>>>>         defined by a program.:
>>>>         rned heade'r asshown below:
>>>>
>>> (radial paths) in sectors (arc sections) defined by a program.:
>>> rned heade'r asshown below:
>>>
>>> Right, I see it. Should be:
>>>
>>> (radial paths) in sectors (arc sections) defined by a program-
>>> med header as shown below:
>>>
>>> So it is an OCR with the text placed in the exact location of the
>>> graphics. I knew about PDFs with OCR but never saw one. I thought it was
>>> different "files" inside the pdf.
>>>
>>> Very nice, even if not perfect.
>>>
>>> I guess we can not do this in Linux :-?
>>>
>>
>> I'm not really following the thread, but you can add OCR to PDF files
>> using Linux. I use OCRmyPDF to add OCR to PDF files from my scanner.
> 
> It isn't really OCR.
> PDF files, that is, real ones,
> contain information to put characters in their proper places.
> 
> The characters are called by name, and converted into pixels
> only for a particular rendering.
> So the information is already inherent in the pdf,
> it only has to be put into usable form.
> Linux has nothing to do with it, pdf, based on Postscript,
> is a programming language in its own right,


No, that is not what I was referring to.

I was referring to scanning papers, and generating a PDF that contains 
both the scanned image and the OCR obtained of it, in a manner that we 
can select portions of the image with the mouse and get the 
corresponding text instead. The PDF linked at the start of this post is 
a perfect example.

And specifically doing that in Linux, locally, not using an online service.

-- 
Cheers, Carlos.
ES🇪🇸, EU🇪🇺;

Back to comp.os.linux.misc | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-21 10:55 +0100
  Re: PDF and OCR nospam@de-ster.demon.nl (J. J. Lodder) - 2025-09-24 20:43 +0200
    Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-24 22:26 +0200
    Re: PDF and OCR Pancho <Pancho.Jones@protonmail.com> - 2025-09-25 09:33 +0100
      Re: PDF and OCR "Carlos E.R." <robin_listas@es.invalid> - 2025-09-25 14:09 +0200
    Re: PDF and OCR The Natural Philosopher <tnp@invalid.invalid> - 2025-09-25 09:49 +0100

csiph-web