Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #66537 > unrolled thread
| Started by | "F.R." <anthra.norell@bluewin.ch> |
|---|---|
| First post | 2014-02-16 15:00 +0100 |
| Last post | 2014-02-16 18:59 +0000 |
| Articles | 3 — 3 participants |
Back to article view | Back to comp.lang.python
Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000
| From | "F.R." <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2014-02-16 15:00 +0100 |
| Subject | Puzzling PDF |
| Message-ID | <mailman.7056.1392559276.18130.python-list@python.org> |
Hi all, Struggling to parse bank statements unavailable in sensible data-transfer formats, I use pdftotext, which solves part of the problem. The other day I encountered a strange thing, when one single figure out of many erroneously converted into letters. Adobe Reader displays the figure 50'000 correctly, but pdftotext makes it into "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would expect such a mistake from an OCR. However, the statement is not a scan, but is made up of text. Because malfunctions like this put a damper on the hope to ever have a reliable reader that doesn't require time-consuming manual verification, I played around a bit and ended up even more confused: When I lift the figure off the Adobe display (mark, copy) and paste it into a Python IDLE window, it is again letters (ascii 83 and 79), when on the Adobe display it shows correctly as digits. How can that be? Frederic
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-02-16 10:33 -0500 |
| Message-ID | <roy-68EFBC.10333916022014@news.panix.com> |
| In reply to | #66537 |
In article <mailman.7056.1392559276.18130.python-list@python.org>, "F.R." <anthra.norell@bluewin.ch> wrote: > Hi all, > > Struggling to parse bank statements unavailable in sensible > data-transfer formats, I use pdftotext, which solves part of the > problem. The other day I encountered a strange thing, when one single > figure out of many erroneously converted into letters. Adobe Reader > displays the figure 50'000 correctly, but pdftotext makes it into > "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would > expect such a mistake from an OCR. However, the statement is not a scan, > but is made up of text. Because malfunctions like this put a damper on > the hope to ever have a reliable reader that doesn't require > time-consuming manual verification, I played around a bit and ended up > even more confused: When I lift the figure off the Adobe display (mark, > copy) and paste it into a Python IDLE window, it is again letters (ascii > 83 and 79), when on the Adobe display it shows correctly as digits. How > can that be? > > Frederic Maybe it's an intentional effort to keep people from screen-scraping data out of the PDFs (or perhaps trace when they do). Is it possible the document includes a font where those codepoints are drawn exactly the same as the digits they resemble? Keep in mind that PDF is not a data transmission format, it's a document format. When you try to scape data out of a PDF, you've made a pact with the devil. Unclear what any of this has to do with Python. Maybe the tie-in is that in the old Snake video game, the snake was drawn as Soooooo? Anyway, it's S as in Sierra, and O as in Oscar.
[toc] | [prev] | [next] | [standalone]
| From | Alister <alister.ware@ntlworld.com> |
|---|---|
| Date | 2014-02-16 18:59 +0000 |
| Message-ID | <tU7Mu.3501$BM7.662@fx18.am4> |
| In reply to | #66546 |
On Sun, 16 Feb 2014 10:33:39 -0500, Roy Smith wrote: > In article <mailman.7056.1392559276.18130.python-list@python.org>, > "F.R." <anthra.norell@bluewin.ch> wrote: > >> Hi all, >> >> Struggling to parse bank statements unavailable in sensible >> data-transfer formats, I use pdftotext, which solves part of the >> problem. The other day I encountered a strange thing, when one single >> figure out of many erroneously converted into letters. Adobe Reader >> displays the figure 50'000 correctly, but pdftotext makes it into >> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would >> expect such a mistake from an OCR. However, the statement is not a >> scan, >> but is made up of text. Because malfunctions like this put a damper on >> the hope to ever have a reliable reader that doesn't require >> time-consuming manual verification, I played around a bit and ended up >> even more confused: When I lift the figure off the Adobe display (mark, >> copy) and paste it into a Python IDLE window, it is again letters >> (ascii 83 and 79), when on the Adobe display it shows correctly as >> digits. How can that be? >> >> Frederic > > Maybe it's an intentional effort to keep people from screen-scraping > data out of the PDFs (or perhaps trace when they do). Is it possible > the document includes a font where those codepoints are drawn exactly > the same as the digits they resemble? This seems to be the most likely explanation to me although I would like to know why. Assuming these are your bank statements I would change bank Mine are available in a variety of formats (QIF & CSV) so that they can be used in my own accounting programs if i desire. I see no reason why the bank would want to prevent me accessing this data -- Without life, Biology itself would be impossible.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web