Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'binary': 0.07; 'error:': 0.07; 'failing': 0.07; 'puts': 0.07; 'ascii': 0.09; 'bytes,': 0.09; 'encode': 0.09; 'indexes': 0.09; 'line:': 0.09; 'method,': 0.09; 'newline': 0.09; 'objects,': 0.09; 'portions': 0.09; 'spec': 0.09; 'subject:module': 0.09; 'subject:question': 0.10; 'api': 0.11; 'python': 0.11; "'rb')": 0.16; '8-bit': 0.16; 'ascii,': 0.16; 'binary,': 0.16; 'binary.': 0.16; 'count.': 0.16; 'non- ascii': 0.16; 'offsets': 0.16; "pdf's": 0.16; 'read()': 0.16; 'represents.': 0.16; 'rules.': 0.16; 'set,': 0.16; 'specifying': 0.16; 'string)': 0.16; 'throw': 0.16; 'typeerror:': 0.16; 'appropriate': 0.16; 'wrote:': 0.18; 'library': 0.18; 'bit': 0.19; 'basically': 0.19; 'file,': 0.19; 'mechanism': 0.19; 'pieces': 0.19; 'split': 0.19; 'import': 0.22; 'header:User-Agent:1': 0.23; 'byte': 0.24; 'case.': 0.24; 'unicode': 0.24; 'versions': 0.24; 'file.': 0.24; '(or': 0.24; 'post': 0.26; 'least': 0.26; 'header :In-Reply-To:1': 0.27; 'character': 0.29; "doesn't": 0.30; 'characters': 0.30; "i'm": 0.30; 'gives': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'usually': 0.31; 'author,': 0.31; 'os,': 0.31; 'piece': 0.31; 'yourself.': 0.31; 'file': 0.32; 'thanks!': 0.32; 'another': 0.32; 'text': 0.33; 'guess': 0.33; 'sense': 0.34; "i'd": 0.34; 'connection': 0.35; 'except': 0.35; 'something': 0.35; 'case,': 0.35; 'convert': 0.35; 'objects': 0.35; 'but': 0.35; 'there': 0.35; 'format.': 0.36; 'sequence': 0.36; 'shows': 0.36; 'possible': 0.36; 'should': 0.36; 'error.': 0.37; 'example,': 0.37; 'skip:- 20': 0.37; 'christian': 0.38; 'represent': 0.38; 'thank': 0.38; 'filter': 0.38; 'to:addr:python- list': 0.38; 'pm,': 0.38; 'pdf': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'called': 0.40; 'how': 0.40; 'read': 0.60; 'most': 0.60; 'break': 0.61; 'skip:o 30': 0.61; 'full': 0.61; 'mentioned': 0.61; 'simply': 0.61; 'first': 0.61; 'real': 0.63; 'such': 0.63; 'happen': 0.63; 'more': 0.64; 'between': 0.67; 'received:74.208': 0.68; 'default': 0.69; 'therefore': 0.72; 'funny': 0.74; '(standard': 0.84; 'captures': 0.84; 'characters,': 0.84; 'fails,': 0.84; 'pdf.': 0.84; 'received:74.208.4.194': 0.84; 'streams': 0.84; 'rusi': 0.91 Date: Tue, 25 Jun 2013 22:10:54 -0400 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130510 Thunderbird/17.0.6 MIME-Version: 1.0 To: python-list@python.org Subject: Re: io module and pdf question References: <20130625161520.OWWUK.343277.root@hrndva-web23-z01> In-Reply-To: <20130625161520.OWWUK.343277.root@hrndva-web23-z01> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:rjQmSK0BcfwNv0RKp+z/XeOpJpNcU21N85CHjKJZGQ7 V7bFtCaA0/Z7/9PiQGBrXLNB0cBst2FuYulyMcoVGJ+3TjdB8E 3w9IlM3fWCecUaOWDjPb0aTRk1swwLeqCeDZmK09Ep4raSFuz/ H+zfmwzgBAWmIiy5Qjrj7GwuErlcPPBB8/Dcx+CroPBZBmO5Ey 8UWaq7yJrYozy8Q7rGUdYOqxQKB7XFdbzjQwHhf1ISkb6CL+cc 7+vQZi9TQri6sxHz2uuNWCB9tnVnizKEpSChjjSVkNCOS5uSf+ +2ApYMN4VKKGwynbD+AWN1glgCegIOiaQ/nfYOZ/Aw3eV79t18 XB9FTFDFFcOPkkNFPSGw= X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 86 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1372212671 news.xs4all.nl 15867 [2001:888:2000:d::a6]:56454 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:49223 On 06/25/2013 12:15 PM, jyoung79@kc.rr.com wrote: > Thank you Rusi and Christian! > Something I don't think was mentioned was that reading a text file in Python 3, and specifying latin-1, will work simply because every possible 8-bit byte is a character in Latin-1 That doesn't mean that those characters you get have any connection with the real meaning of the file. > So it sounds like I should read the pdf data in as binary: > > -------------------- > import os > > pdfPath = '~/Desktop/test.pdf' > > colorlistData = '' > > with open(os.path.expanduser(pdfPath), 'rb') as f: > for i in f: > if 'XYZ:colorList' in i: > colorlistData = i.split('XYZ:colorList')[1] > break > > print(colorlistData) > -------------------- > > This gives me the error: > TypeError: Type str doesn't support the buffer API That's just a tiny piece of the error. Post the full traceback, which shows the line that fails, and what called it, and so on. In this case I'd guess that the line: for i in f: is failing since that mechanism is for reading lines in a text file. For reading streams of bytes, you have the read() method, where you supply your own count. > > I admit I know nothing about binary, except it's ones and zeroes. Is there a way to read it in as binary, convert it to ascii/unicode, That makes no sense without knowing what the binary data represents. It MIGHT be that pieces of it will actually be valid ascii, or valid unicode (encoded with some encoding). But you would have to ask the author, or look up the spec for that particular binary file format. I'm not familiar at all with how PDF's are encoded, so I don't know what the possibilities. One hacky approach is to use the strings utility (standard on most versions of Unix/Linux) to basically throw out most of the file, keeping only those portions of it that happen to look like reasonable ASCII. By default it captures each consecutive sequence of at least 4 ASCII printable characters, and puts a newline to represent one or more unprintable or non-ASCII characters between them. If you cannot find strings (or string) for your OS, you can write the filter yourself. But much better would be to use some library that understood the PDF format rules. > and then somehow split it by newline characters so that I can pull the appropriate metadata lines out? For example, XYZ:colorList="DarkBlue,Yellow" > > Thanks! > > Jay > > -- > >> Most of the PDF objects are therefore not encoded. It is, however, >> possible to include a PDF into another PDF and to encode it, but that's >> a rare case. Therefore the metadata can usually be read in text mode. >> However, to correctly find all objects, the xref-table indexes offsets >> into the PDF. It must be treated binary in any case, and that's the >> funny reason for the first 3 characters of the PDF - they must include >> characters with the 8th bit set, such that FTP applications treat it as >> binary. > >> Christian -- DaveA