Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #66537

Puzzling PDF

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <anthra.norell@bluewin.ch>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.063
X-Spam-Evidence '*H*': 0.88; '*S*': 0.00; 'converted': 0.09; 'statements': 0.09; 'python': 0.11; '"o"': 0.16; '"s"': 0.16; 'digits.': 0.16; 'erroneously': 0.16; 'letters.': 0.16; 'received:195.186': 0.16; 'received:bluewin.ch': 0.16; 'unavailable': 0.16; 'all,': 0.19; 'bit': 0.19; '(the': 0.22; 'manual': 0.22; 'header:User-Agent:1': 0.23; 'parse': 0.24; 'text.': 0.24; "doesn't": 0.30; 'statement': 0.30; 'figure': 0.32; 'reader': 0.33; 'display': 0.35; 'problem.': 0.35; 'but': 0.35; 'idle': 0.36; 'shows': 0.36; 'displays': 0.38; 'to:addr:python- list': 0.38; 'expect': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'even': 0.60; 'letters': 0.60; 'hope': 0.61; 'such': 0.63; 'more': 0.64; 'day': 0.76; 'bank': 0.76; 'played': 0.84; 'lift': 0.91; 'mistake': 0.91; 'thing,': 0.91
Date Sun, 16 Feb 2014 15:00:00 +0100
From "F.R." <anthra.norell@bluewin.ch>
User-Agent Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version 1.0
To python-list@python.org
Subject Puzzling PDF
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.7056.1392559276.18130.python-list@python.org> (permalink)
Lines 26
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1392559276 news.xs4all.nl 2886 [2001:888:2000:d::a6]:34209
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:66537

Show key headers only | View raw


Hi all,

Struggling to parse bank statements unavailable in sensible 
data-transfer formats, I use pdftotext, which solves part of the 
problem. The other day I encountered a strange thing, when one single 
figure out of many erroneously converted into letters. Adobe Reader 
displays the figure 50'000 correctly, but pdftotext makes it into 
"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would 
expect such a mistake from an OCR. However, the statement is not a scan, 
but is made up of text. Because malfunctions like this put a damper on 
the hope to ever have a reliable reader that doesn't require 
time-consuming manual verification, I played around a bit and ended up 
even more confused: When I lift the figure off the Adobe display (mark, 
copy) and paste it into a Python IDLE window, it is again letters (ascii 
83 and 79), when on the Adobe display it shows correctly as digits. How 
can that be?

Frederic







Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Puzzling PDF "F.R." <anthra.norell@bluewin.ch> - 2014-02-16 15:00 +0100
  Re: Puzzling PDF Roy Smith <roy@panix.com> - 2014-02-16 10:33 -0500
    Re: Puzzling PDF Alister <alister.ware@ntlworld.com> - 2014-02-16 18:59 +0000

csiph-web