Groups > comp.lang.python > #66537 > unrolled thread

Puzzling PDF

Started by	"F.R." <anthra.norell@bluewin.ch>
First post	2014-02-16 15:00 +0100
Last post	2014-02-16 18:59 +0000
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

  Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
    Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
      Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000

#66537 — Puzzling PDF

From	"F.R." <anthra.norell@bluewin.ch>
Date	2014-02-16 15:00 +0100
Subject	Puzzling PDF
Message-ID	<mailman.7056.1392559276.18130.python-list@python.org>

Hi all,

Struggling to parse bank statements unavailable in sensible 
data-transfer formats, I use pdftotext, which solves part of the 
problem. The other day I encountered a strange thing, when one single 
figure out of many erroneously converted into letters. Adobe Reader 
displays the figure 50'000 correctly, but pdftotext makes it into 
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would 
expect such a mistake from an OCR. However, the statement is not a scan, 
but is made up of text. Because malfunctions like this put a damper on 
the hope to ever have a reliable reader that doesn't require 
time-consuming manual verification, I played around a bit and ended up 
even more confused: When I lift the figure off the Adobe display (mark, 
copy) and paste it into a Python IDLE window, it is again letters (ascii 
83 and 79), when on the Adobe display it shows correctly as digits. How 
can that be?

Frederic

[toc] | [next] | [standalone]

#66546

From	Roy Smith <roy@panix.com>
Date	2014-02-16 10:33 -0500
Message-ID	<roy-68EFBC.10333916022014@news.panix.com>
In reply to	#66537

In article <mailman.7056.1392559276.18130.python-list@python.org>,
 "F.R." <anthra.norell@bluewin.ch> wrote:

> Hi all,
> 
> Struggling to parse bank statements unavailable in sensible 
> data-transfer formats, I use pdftotext, which solves part of the 
> problem. The other day I encountered a strange thing, when one single 
> figure out of many erroneously converted into letters. Adobe Reader 
> displays the figure 50'000 correctly, but pdftotext makes it into 
> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would 
> expect such a mistake from an OCR. However, the statement is not a scan, 
> but is made up of text. Because malfunctions like this put a damper on 
> the hope to ever have a reliable reader that doesn't require 
> time-consuming manual verification, I played around a bit and ended up 
> even more confused: When I lift the figure off the Adobe display (mark, 
> copy) and paste it into a Python IDLE window, it is again letters (ascii 
> 83 and 79), when on the Adobe display it shows correctly as digits. How 
> can that be?
> 
> Frederic

Maybe it's an intentional effort to keep people from screen-scraping 
data out of the PDFs (or perhaps trace when they do).  Is it possible 
the document includes a font where those codepoints are drawn exactly 
the same as the digits they resemble?

Keep in mind that PDF is not a data transmission format, it's a document 
format.  When you try to scape data out of a PDF, you've made a pact 
with the devil.

Unclear what any of this has to do with Python.  Maybe the tie-in is 
that in the old Snake video game, the snake was drawn as Soooooo?

Anyway, it's S as in Sierra, and O as in Oscar.

[toc] | [prev] | [next] | [standalone]

#66562

From	Alister <alister.ware@ntlworld.com>
Date	2014-02-16 18:59 +0000
Message-ID	<tU7Mu.3501$BM7.662@fx18.am4>
In reply to	#66546

On Sun, 16 Feb 2014 10:33:39 -0500, Roy Smith wrote:

> In article <mailman.7056.1392559276.18130.python-list@python.org>,
>  "F.R." <anthra.norell@bluewin.ch> wrote:
> 
>> Hi all,
>> 
>> Struggling to parse bank statements unavailable in sensible
>> data-transfer formats, I use pdftotext, which solves part of the
>> problem. The other day I encountered a strange thing, when one single
>> figure out of many erroneously converted into letters. Adobe Reader
>> displays the figure 50'000 correctly, but pdftotext makes it into
>> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
>> expect such a mistake from an OCR. However, the statement is not a
>> scan,
>> but is made up of text. Because malfunctions like this put a damper on
>> the hope to ever have a reliable reader that doesn't require
>> time-consuming manual verification, I played around a bit and ended up
>> even more confused: When I lift the figure off the Adobe display (mark,
>> copy) and paste it into a Python IDLE window, it is again letters
>> (ascii 83 and 79), when on the Adobe display it shows correctly as
>> digits. How can that be?
>> 
>> Frederic
> 
> Maybe it's an intentional effort to keep people from screen-scraping
> data out of the PDFs (or perhaps trace when they do).  Is it possible
> the document includes a font where those codepoints are drawn exactly
> the same as the digits they resemble?

This seems to be the most likely explanation to me although I would like 
to know why.
Assuming these are your bank statements I would change bank

Mine are available in a variety of formats (QIF & CSV) so that they can 
be used in my own accounting programs if i desire.

I see no reason why the bank would want to prevent me accessing this data



-- 
Without life, Biology itself would be impossible.

[toc] | [prev] | [standalone]

csiph-web

Puzzling PDF

Contents

#66537 — Puzzling PDF

#66546

#66562