Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #66537
| Path | csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <anthra.norell@bluewin.ch> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.063 |
| X-Spam-Evidence | '*H*': 0.88; '*S*': 0.00; 'converted': 0.09; 'statements': 0.09; 'python': 0.11; '"o"': 0.16; '"s"': 0.16; 'digits.': 0.16; 'erroneously': 0.16; 'letters.': 0.16; 'received:195.186': 0.16; 'received:bluewin.ch': 0.16; 'unavailable': 0.16; 'all,': 0.19; 'bit': 0.19; '(the': 0.22; 'manual': 0.22; 'header:User-Agent:1': 0.23; 'parse': 0.24; 'text.': 0.24; "doesn't": 0.30; 'statement': 0.30; 'figure': 0.32; 'reader': 0.33; 'display': 0.35; 'problem.': 0.35; 'but': 0.35; 'idle': 0.36; 'shows': 0.36; 'displays': 0.38; 'to:addr:python- list': 0.38; 'expect': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'even': 0.60; 'letters': 0.60; 'hope': 0.61; 'such': 0.63; 'more': 0.64; 'day': 0.76; 'bank': 0.76; 'played': 0.84; 'lift': 0.91; 'mistake': 0.91; 'thing,': 0.91 |
| Date | Sun, 16 Feb 2014 15:00:00 +0100 |
| From | "F.R." <anthra.norell@bluewin.ch> |
| User-Agent | Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
| MIME-Version | 1.0 |
| To | python-list@python.org |
| Subject | Puzzling PDF |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 7bit |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.15 |
| Precedence | list |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list/> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.7056.1392559276.18130.python-list@python.org> (permalink) |
| Lines | 26 |
| NNTP-Posting-Host | 2001:888:2000:d::a6 |
| X-Trace | 1392559276 news.xs4all.nl 2886 [2001:888:2000:d::a6]:34209 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | csiph.com comp.lang.python:66537 |
Show key headers only | View raw
Hi all, Struggling to parse bank statements unavailable in sensible data-transfer formats, I use pdftotext, which solves part of the problem. The other day I encountered a strange thing, when one single figure out of many erroneously converted into letters. Adobe Reader displays the figure 50'000 correctly, but pdftotext makes it into "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would expect such a mistake from an OCR. However, the statement is not a scan, but is made up of text. Because malfunctions like this put a damper on the hope to ever have a reliable reader that doesn't require time-consuming manual verification, I played around a bit and ended up even more confused: When I lift the figure off the Adobe display (mark, copy) and paste it into a Python IDLE window, it is again letters (ascii 83 and 79), when on the Adobe display it shows correctly as digits. How can that be? Frederic
Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread
Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000
csiph-web