Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #66553
| Path | csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed2a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python-python-list@m.gmane.org> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.034 |
| X-Spam-Evidence | '*H*': 0.93; '*S*': 0.00; 'converted': 0.09; 'explanation': 0.09; 'happen.': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'statements': 0.09; 'python': 0.11; 'wrote': 0.14; '"o"': 0.16; '"s"': 0.16; '6:00': 0.16; 'digits.': 0.16; 'erroneously': 0.16; 'exist.': 0.16; 'letters.': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'statement.': 0.16; 'unavailable': 0.16; 'wrote:': 0.18; 'all,': 0.19; 'bit': 0.19; '(the': 0.22; 'manual': 0.22; 'otherwise,': 0.22; 'header:User-Agent:1': 0.23; 'alternate': 0.24; 'parse': 0.24; 'text.': 0.24; "i've": 0.25; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; '[1]': 0.29; 'am,': 0.29; "doesn't": 0.30; 'statement': 0.30; 'getting': 0.31; 'figure': 0.32; 'text': 0.33; 'reader': 0.33; "i'd": 0.34; 'info': 0.35; 'display': 0.35; 'problem.': 0.35; 'tool': 0.35; 'but': 0.35; 'google': 0.35; 'idle': 0.36; 'shows': 0.36; 'displays': 0.38; 'on-line': 0.38; 'to:addr:python-list': 0.38; 'expect': 0.39; 'pdf': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'even': 0.60; 'letters': 0.60; 'results.': 0.60; 'hope': 0.61; 'conversion': 0.61; 'new': 0.61; "you're": 0.61; 'first': 0.61; "you'll": 0.62; 'complete': 0.62; 'such': 0.63; 'more': 0.64; 'here': 0.66; 'results': 0.69; 'gotten': 0.74; 'day': 0.76; 'bank': 0.76; 'received:12': 0.81; 'pdf.': 0.84; 'played': 0.84; 'lift': 0.91; 'mistake': 0.91; 'thing,': 0.91; 'inquiry': 0.93; 'lucky': 0.93 |
| X-Injected-Via-Gmane | http://gmane.org/ |
| To | python-list@python.org |
| From | Emile van Sebille <emile@fenx.com> |
| Subject | Re: Puzzling PDF |
| Date | Sun, 16 Feb 2014 08:29:11 -0800 |
| References | <5300C460.8000702@bluewin.ch> |
| Mime-Version | 1.0 |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 7bit |
| X-Gmane-NNTP-Posting-Host | 12.184.110.78 |
| User-Agent | Mozilla/5.0 (Windows NT 6.2; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 |
| In-Reply-To | <5300C460.8000702@bluewin.ch> |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.7064.1392568170.18130.python-list@python.org> (permalink) |
| Lines | 42 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1392568170 news.xs4all.nl 2936 [2001:888:2000:d::a6]:36078 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:66553 |
Show key headers only | View raw
You On 2/16/2014 6:00 AM, F.R. wrote: > Hi all, > > Struggling to parse bank statements unavailable in sensible > data-transfer formats, I use pdftotext, which solves part of the > problem. The other day I encountered a strange thing, when one single > figure out of many erroneously converted into letters. Adobe Reader > displays the figure 50'000 correctly, but pdftotext makes it into > "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would > expect such a mistake from an OCR. However, the statement is not a scan, > but is made up of text. Because malfunctions like this put a damper on > the hope to ever have a reliable reader that doesn't require > time-consuming manual verification, I played around a bit and ended up > even more confused: When I lift the figure off the Adobe display (mark, > copy) and paste it into a Python IDLE window, it is again letters (ascii > 83 and 79), when on the Adobe display it shows correctly as digits. How > can that be? > I've also gotten inconsistent results using various pdf to text converters[1], but getting an explanation for pdf2totext's failings here isn't likely to happen. I'd first try google doc's on-line conversion tool to see if you get better results. If you're lucky it'll do the job and you'll have confirmation that better tools exist. Otherwise, I'd look for an alternate way of getting the bank info than working from the pdf statement. At one site I've scripted firefox to access the bank's web based inquiry to retrieve the new activity overnight and use that to complete a daily bank reconciliation. HTH, Emile [1] I wrote my own once to get data out of a particularly gnarly EDI specification pdf.
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Puzzling PDF Emile van Sebille <emile@fenx.com> - 2014-02-16 08:29 -0800
csiph-web