Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.04; 'skip:[ 20': 0.04; 'syntax': 0.04; 'encoding': 0.05; 'output': 0.05; 'args': 0.07; 'attribute': 0.07; 'binary': 0.07; 'encoded': 0.07; 'processing.': 0.07; 'utf-8': 0.07; 'string': 0.09; '[2]:': 0.09; 'ascii': 0.09; 'chunk': 0.09; 'decodes': 0.09; 'encode': 0.09; 'indication': 0.09; 'issue:': 0.09; 'parsed': 0.09; 'parsing': 0.09; 'sequences.': 0.09; 'snippet': 0.09; 'sub': 0.09; 'subject:module': 0.09; 'def': 0.12; '(mainly': 0.16; '(pdb)': 0.16; '314': 0.16; '315': 0.16; '315,': 0.16; 'base64': 0.16; 'codec': 0.16; 'dump': 0.16; 'illustrating': 0.16; 'ldif': 0.16; 'michael,': 0.16; 'ordinal': 0.16; 'overriding': 0.16; 'range(0,': 0.16; 'received:172.18.0': 0.16; "skip:' 60": 0.16; 'module': 0.19; 'pieces': 0.19; "skip:' 30": 0.19; 'skip:1 30': 0.19; 'skip:p 40': 0.19; 'stack': 0.19; 'later': 0.20; 'meant': 0.20; '>>>': 0.22; 'input': 0.22; 'import': 0.22; 'to:name:python- list@python.org': 0.22; 'byte': 0.24; 'errors.': 0.24; 'unicode': 0.24; 'non': 0.24; 'looks': 0.24; 'question': 0.24; 'skip:" 40': 0.26; 'post': 0.26; 'least': 0.26; 'skip:" 20': 0.27; 'skip:_ 20': 0.27; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'to:2**1': 0.27; 'point': 0.28; 'character': 0.29; 'points': 0.29; 'wonder': 0.29; 'characters': 0.30; "i'm": 0.30; 'code': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'another.': 0.31; 'strip': 0.31; 'though.': 0.31; 'file': 0.32; 'figure': 0.32; 'another': 0.32; '(e.g.': 0.33; "i'd": 0.34; 'could': 0.34; 'problem': 0.35; "can't": 0.35; 'skip:s 30': 0.35; 'skip:u 20': 0.35; 'objects': 0.35; 'operations': 0.35; 'test': 0.35; 'but': 0.35; 'really': 0.36; 'sequence': 0.36; 'doing': 0.36; 'entry': 0.36; 'thanks': 0.36; 'wrong': 0.37; 'two': 0.37; 'being': 0.38; 'skip:o 20': 0.38; 'e.g.': 0.38; 'handle': 0.38; 'to:addr:python-list': 0.38; 'issue': 0.38; 'itself': 0.39; 'moving': 0.39; 'skip:b 40': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'skip:u 10': 0.60; 'is.': 0.60; 'received:unknown': 0.61; 'skip:t 30': 0.61; 'simply': 0.61; 'simple': 0.61; "you're": 0.61; 'further': 0.61; 'back': 0.62; 'soon': 0.63; 'taking': 0.65; 'finally': 0.65; 'of:': 0.68; 'to:charset:iso-8859-1': 0.74; 'comparative': 0.84; 'dict,': 0.84; 'everything,': 0.84; 'fails,': 0.84; 'thing,': 0.91 X-Cloudmark-SP-Filtered: true X-Cloudmark-SP-Result: v=1.1 cv=GLqYwptGXHjY6tPk5kWRtHXJM/YfZPTWiIs1znw4zms= c=1 sm=1 a=CRTDazI5n6YA:10 a=xv9iwkAQU-cA:10 a=7PYXob_7ZXMA:10 a=BLceEmwcHowA:10 a=8nJEP1OIZ-IA:10 a=xqWC_Br6kY4A:10 a=oNw28mxuUhXRB3mVwYQ4Ag==:17 a=2NS8C4AQI35nV9HUg3EA:9 a=wPNLvfGTeEIA:10 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 From: "Joseph L. Casale" To: =?iso-8859-1?Q?Michael_Str=F6der?= , "python-list@python.org" Subject: RE: Ldap module and base64 oncoding Thread-Topic: Ldap module and base64 oncoding Thread-Index: AQHOWkoD2gcVydFyA0eXB9j8yy7c/ZkYU51Q Date: Mon, 27 May 2013 05:15:01 +0000 References: , , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.18.0.200] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 118 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1369631792 news.xs4all.nl 15863 [2001:888:2000:d::a6]:42263 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:46166 Hi Michael,=0A= =0A= > Processing LDIF is one thing, doing LDAP operations another.=0A= > =0A= > LDIF itself is meant to be ASCII-clean. But each attribute value can carr= y any=0A= > byte sequence (e.g. attribute 'jpegPhoto'). There's no further processing= by=0A= > module LDIF - it simply returns byte sequences.=0A= > =0A= > The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings on= the=0A= > wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or = similar.=0A= > =0A= > So if you're LDIF input returns UTF-16 encoded attribute values for e.g.= =0A= > attribute 'cn' or 'o' or another attribute not being of OctetString or Bi= nary=0A= > syntax something's wrong with the producer of the LDIF data.=0A= =0A= That could be, I am using ms's ldifde.exe to dump a domino and AD directory= for=0A= comparative processing. The problem is I don't have much control on the dat= a in=0A= the directory and I do know that DN's have non ascii characters unique to t= he=0A= =0A= > I wonder what the string really is. At least the base64-encoding you prov= ided=0A= > before decodes as UTF-8 but I'm not sure whether it's the right sequence = of=0A= > Unicode code points you're expecting.=0A= > =0A= > >>> 'ZGV0XDMzMTB3YmJccGc=3D'.decode('base64').decode('utf-8')=0A= > u'det\\3310wbb\\pg'=0A= > =0A= > I still can't figure out what you're really doing though. I'd recommend t= o=0A= > strip down your operations to a very simple test code snippet illustratin= g the=0A= > issue and post that here.=0A= =0A= So I have removed all my likely broken attempts at working with this data a= nd will=0A= soon have some simple code but at this point I may have an indication of wh= at is=0A= awry with my data.=0A= =0A= After parsing the data for a user I am simply taking a value from the ldif = file and writing=0A= it back out to another which fails, the value parsed is:=0A= =0A= officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=3D=3D=0A= =0A= =0A= File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse=0A= self._unparseChangeRecord(record)=0A= File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChange= Record=0A= self._unparseAttrTypeandValue(mod_type,mod_val)=0A= File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTy= peandValue=0A= self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_val= ue).replace('\n','')]))=0A= File "C:\Python27\lib\base64.py", line 315, in encodestring=0A= pieces.append(binascii.b2a_base64(chunk))=0A= UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in positio= n 7: ordinal not in range(128)=0A= =0A= > c:\python27\lib\base64.py(315)encodestring()=0A= -> pieces.append(binascii.b2a_base64(chunk))=0A= (Pdb) l=0A= 310 def encodestring(s):=0A= 311 """Encode a string into multiple lines of base-64 data."""=0A= 312 pieces =3D []=0A= 313 for i in range(0, len(s), MAXBINSIZE):=0A= 314 chunk =3D s[i : i + MAXBINSIZE]=0A= 315 -> pieces.append(binascii.b2a_base64(chunk))=0A= 316 return "".join(pieces)=0A= 317=0A= 318=0A= 319 def decodestring(s):=0A= 320 """Decode a string."""=0A= (Pdb) args=0A= s =3D Otto-Me=DFmer-Stra=DFe 1=0A= =0A= So moving up a frame or two and looking at the entry dict, I see a modlist = entry of:=0A= ('streetAddress', [u'Otto-Me\xdfmer-Stra\xdfe 1']) which is correct:=0A= =0A= In [2]: 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=3D=3D'.decode('base64').decode('utf= -8')=0A= Out[2]: u'Otto-Me\xdfmer-Stra\xdfe 1'=0A= =0A= Looking at the stack trace, I think I see the issue:=0A= (Pdb) import base64=0A= (Pdb) base64.encodestring(u'Otto-Me\xdfmer-Stra\xdfe 1'.encode('utf-8')).re= place('\n','')=0A= 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=3D=3D'=0A= =0A= I now have the exact the value I started with. Ensuring where I ever handle= the original=0A= values that I return utf-8 decoded objects for use in a modlist to later wr= ite and Sub=0A= classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encod= ing has=0A= eliminated all the errors.=0A= =0A= What remains finally is ldifde.exe's output of what looks like U+00BF, or a= n inverted question=0A= mark for some values, otherwise this issue looks solved.=0A= =0A= Thanks for everything,=0A= jlc=0A= =0A= =0A= =0A=