Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #66546

Re: Puzzling PDF

From Roy Smith <roy@panix.com>
Newsgroups comp.lang.python
Subject Re: Puzzling PDF
Date 2014-02-16 10:33 -0500
Organization PANIX Public Access Internet and UNIX, NYC
Message-ID <roy-68EFBC.10333916022014@news.panix.com> (permalink)
References <mailman.7056.1392559276.18130.python-list@python.org>

Show all headers | View raw


In article <mailman.7056.1392559276.18130.python-list@python.org>,
 "F.R." <anthra.norell@bluewin.ch> wrote:

> Hi all,
> 
> Struggling to parse bank statements unavailable in sensible 
> data-transfer formats, I use pdftotext, which solves part of the 
> problem. The other day I encountered a strange thing, when one single 
> figure out of many erroneously converted into letters. Adobe Reader 
> displays the figure 50'000 correctly, but pdftotext makes it into 
> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would 
> expect such a mistake from an OCR. However, the statement is not a scan, 
> but is made up of text. Because malfunctions like this put a damper on 
> the hope to ever have a reliable reader that doesn't require 
> time-consuming manual verification, I played around a bit and ended up 
> even more confused: When I lift the figure off the Adobe display (mark, 
> copy) and paste it into a Python IDLE window, it is again letters (ascii 
> 83 and 79), when on the Adobe display it shows correctly as digits. How 
> can that be?
> 
> Frederic

Maybe it's an intentional effort to keep people from screen-scraping 
data out of the PDFs (or perhaps trace when they do).  Is it possible 
the document includes a font where those codepoints are drawn exactly 
the same as the digits they resemble?

Keep in mind that PDF is not a data transmission format, it's a document 
format.  When you try to scape data out of a PDF, you've made a pact 
with the devil.

Unclear what any of this has to do with Python.  Maybe the tie-in is 
that in the old Snake video game, the snake was drawn as Soooooo?

Anyway, it's S as in Sierra, and O as in Oscar.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
  Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
    Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000

csiph-web