X-Received: by 10.224.86.200 with SMTP id t8mr4020786qal.0.1372255887436; Wed, 26 Jun 2013 07:11:27 -0700 (PDT) X-Received: by 10.182.22.71 with SMTP id b7mr4561obf.37.1372255887406; Wed, 26 Jun 2013 07:11:27 -0700 (PDT) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!j2no2716295qak.0!news-out.google.com!fv2ni168qab.0!nntp.google.com!j2no2716294qak.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.python Date: Wed, 26 Jun 2013 07:11:26 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2a02:120b:2c6a:3d90:d25:2921:6b73:9f4a; posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_ NNTP-Posting-Host: 2a02:120b:2c6a:3d90:d25:2921:6b73:9f4a References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: io module and pdf question From: wxjmfauth@gmail.com Injection-Date: Wed, 26 Jun 2013 14:11:27 +0000 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Xref: csiph.com comp.lang.python:49259 Le mardi 25 juin 2013 06:18:44 UTC+2, jyou...@kc.rr.com a =E9crit=A0: > Would like to get your opinion on this. Currently to get the metadata ou= t of a pdf file, I loop through the guts of the file. I know it's not the = greatest idea to do this, but I'm trying to avoid extra modules, etc. >=20 >=20 >=20 > Adobe javascript was used to insert the metadata, so the added data looks= something like this: >=20 >=20 >=20 > XYZ:colorList=3D"DarkBlue,Yellow" >=20 >=20 >=20 > With python 2.7, it successfully loops through the file contents and I'm = able to find the line that contains "XYZ:colorList". >=20 >=20 >=20 > However, when I try to run it with python 3, it errors: >=20 >=20 >=20 > File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/c= odecs.py", line 300, in decode >=20 > (result, consumed) =3D self._buffer_decode(data, self.errors, final) >=20 > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: = invalid continuation byte >=20 >=20 >=20 > I've done some research on this, and it looks like encoding it to latin-1= works. I also found that if I use the io module, it will work on both pyt= hon 2.7 and 3.3. For example: >=20 >=20 >=20 > -------------- >=20 > import io >=20 > import os >=20 >=20 >=20 > pdfPath =3D '~/Desktop/test.pdf' >=20 >=20 >=20 > colorlistData =3D '' >=20 >=20 >=20 > with io.open(os.path.expanduser(pdfPath), 'r', encoding=3D'latin-1') as f= : >=20 > for i in f: >=20 > if 'XYZ:colorList' in i: >=20 > colorlistData =3D i.split('XYZ:colorList')[1] >=20 > break >=20 >=20 >=20 > print(colorlistData) >=20 > -------------- >=20 >=20 >=20 > As you can tell, I'm clueless in how exactly this works and am hoping som= eone can give me some insight on: >=20 > 1. Is there another way to get metadata out of a pdf without having to in= stall another module? >=20 > 2. Is it safe to assume pdf files should always be encoded as latin-1 (wh= en trying to read it this way)? Is there a chance they could be something = else? >=20 > 3. Is the io module a good way to pursue this? >=20 >=20 >=20 > Thanks for your help! >=20 >=20 >=20 > Jay ----------- Forget latin-1. There is nothing wrong in attempting to get such information by reading a pdf file in a binary mode. What is important is to know and be aware about what you are searching and to do the work correctly. A complete example with the pdf file, hypermeta.pdf, I produced which contains the string "abc=E9=80" as Subject metadata. pdf version: 1.4 producer: LaTeX with hyperref package (personal comment: "xdvipdfmx") Python 3.2 >>> with open('hypermeta.pdf', 'rb') as fo: ... r =3D fo.read() ... =20 >>> p1 =3D r.find(b'Subject<') >>> p1 4516 >>> p2 =3D r.find(b'>', p1) >>> p2 4548 >>> rr =3D r[p1:p2+1] >>> rr b'Subject' >>> rrr =3D rr[len(b'Subject<'):-1] >>> rrr b'feff00610062006300e920ac' >>> # decoding the information >>> rrr =3D rrr.decode('ascii') >>> rrr 'feff00610062006300e920ac' >>> i =3D 0 >>> a =3D [] >>> while i < len(rrr): ... t =3D rrr[i:i+4] ... a.append(t) ... i +=3D 4 ... =20 >>> a ['feff', '0061', '0062', '0063', '00e9', '20ac'] >>> b =3D [(int(e, 16) for e in a] File "", line 1 b =3D [(int(e, 16) for e in a] ^ SyntaxError: invalid syntax >>> # oops, error allowed >>> b =3D [int(e, 16) for e in a] >>> b [65279, 97, 98, 99, 233, 8364] >>> c =3D [chr(e) for e in b] >>> c ['\ufeff', 'a', 'b', 'c', '=E9', '=80'] >>> # result >>> d =3D ''.join(c) >>> d '\ufeffabc=E9=80' >>> d =3D d[1:] >>>=20 >>>=20 >>> d 'abc=E9=80' As Christian Gollwitzer pointed out, not all objects in a pdf are encoded in that way. Do not expect to get the contain, the "text" is that way. When built with the Unicode technology, the text of a pdf is composed with a *unique* set of abstract ID's, constructed with the help of the unicode code points table and with the properties of the font (OpenType) used in that pdf, this is equivalent to the utf8/16/32 transformers in "plain unicode". Luckily for the crowd, in 2103, there are people (devs) who are understanding the coding of characters, unicode and how to use it. jmf