Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #66562
| From | Alister <alister.ware@ntlworld.com> |
|---|---|
| Subject | Re: Puzzling PDF |
| Newsgroups | comp.lang.python |
| References | <mailman.7056.1392559276.18130.python-list@python.org> <roy-68EFBC.10333916022014@news.panix.com> |
| Message-ID | <tU7Mu.3501$BM7.662@fx18.am4> (permalink) |
| Organization | virginmedia.com |
| Date | 2014-02-16 18:59 +0000 |
On Sun, 16 Feb 2014 10:33:39 -0500, Roy Smith wrote: > In article <mailman.7056.1392559276.18130.python-list@python.org>, > "F.R." <anthra.norell@bluewin.ch> wrote: > >> Hi all, >> >> Struggling to parse bank statements unavailable in sensible >> data-transfer formats, I use pdftotext, which solves part of the >> problem. The other day I encountered a strange thing, when one single >> figure out of many erroneously converted into letters. Adobe Reader >> displays the figure 50'000 correctly, but pdftotext makes it into >> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would >> expect such a mistake from an OCR. However, the statement is not a >> scan, >> but is made up of text. Because malfunctions like this put a damper on >> the hope to ever have a reliable reader that doesn't require >> time-consuming manual verification, I played around a bit and ended up >> even more confused: When I lift the figure off the Adobe display (mark, >> copy) and paste it into a Python IDLE window, it is again letters >> (ascii 83 and 79), when on the Adobe display it shows correctly as >> digits. How can that be? >> >> Frederic > > Maybe it's an intentional effort to keep people from screen-scraping > data out of the PDFs (or perhaps trace when they do). Is it possible > the document includes a font where those codepoints are drawn exactly > the same as the digits they resemble? This seems to be the most likely explanation to me although I would like to know why. Assuming these are your bank statements I would change bank Mine are available in a variety of formats (QIF & CSV) so that they can be used in my own accounting programs if i desire. I see no reason why the bank would want to prevent me accessing this data -- Without life, Biology itself would be impossible.
Back to comp.lang.python | Previous | Next — Previous in thread | Find similar | Unroll thread
Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000
csiph-web