Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'example:': 0.03; 'main()': 0.05; 'parser': 0.05; 'python)': 0.05; 'xml,': 0.05; 'subject:Python': 0.05; 'prints': 0.07; 'subject:xml': 0.07; 'python': 0.08; '"my': 0.09; "'''": 0.09; '__name__': 0.09; 'preserves': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:80.91.229.12': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'received:lo.gmane.org': 0.09; 'xml.': 0.09; 'output': 0.10; 'def': 0.13; 'binary': 0.13; 'skip:f 30': 0.13; 'converting': 0.15; 'processor': 0.15; "'__main__':": 0.16; '1))': 0.16; 'f.close()': 0.16; 'from:addr:behnel.de': 0.16; 'from:addr:stefan_ml': 0.16; 'from:name:stefan behnel': 0.16; 'main():': 0.16; 'parsed': 0.16; 'pythons': 0.16; 'subject:handling': 0.16; 'url:effbot': 0.16; 'url:zone': 0.16; '(i.e.': 0.17; 'bytes': 0.18; 'file,': 0.19; 'subject:problem': 0.19; 'template': 0.19; '(which': 0.19; 'seems': 0.20; 'result.': 0.21; 'input': 0.22; 'header:In-Reply-To:1': 0.22; 'referring': 0.23; "shouldn't": 0.23; '(my': 0.24; 'stefan': 0.24; 'mode': 0.25; 'tree': 0.25; 'code': 0.25; 'code.': 0.26; 'function': 0.27; 'import': 0.27; 'tried': 0.27; 'code,': 0.27; 'skip:= 10': 0.28; 'work.': 0.28; 'pass': 0.29; 'mind.': 0.29; 'problem': 0.29; 'print': 0.29; 'skip:p 30': 0.29; 'example': 0.29; '"+"': 0.30; 'queue': 0.30; 'whitespace': 0.30; 'least': 0.30; 'xml': 0.31; 'does': 0.32; 'remaining': 0.32; 'header:User-Agent:1': 0.33; 'header:X-Complaints-To:1': 0.33; 'that,': 0.33; 'there': 0.33; 'to:addr:python-list': 0.34; 'it.': 0.34; 'character': 0.34; 'received:84': 0.34; 'things': 0.34; 'parse': 0.34; 'preserve': 0.34; 'root': 0.34; '(not': 0.35; 'something': 0.35; '(for': 0.35; 'file': 0.36; 'example,': 0.37; 'element': 0.37; 'skip:p 50': 0.37; 'two': 0.37; 'but': 0.37; 'open': 0.38; 'received:org': 0.38; 'skip:o 20': 0.38; 'characters': 0.39; 'possible.': 0.39; 'processing': 0.39; 'url:org': 0.39; '(with': 0.39; 'files': 0.39; "it's": 0.40; 'to:addr:python.org': 0.40; 'hope': 0.61; 'your': 0.61; 'back': 0.62; 'grow': 0.62; 'plus': 0.66; 'url:htm': 0.72; '(loaded': 0.84; 'and:': 0.84; 'flushing': 0.84; 'so:': 0.84; 'growing.': 0.91 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Stefan Behnel Subject: Re: Python 3 - xml - crlf handling problem Date: Fri, 02 Dec 2011 12:23:54 +0100 References: <3aae0b18-a194-444f-a2fc-da156204bd95@20g2000yqa.googlegroups.com> <6cc21b82-aa31-47d2-8510-5d629b6c12f2@t16g2000vba.googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: dslb-084-056-014-043.pools.arcor-ip.net User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15 In-Reply-To: <6cc21b82-aa31-47d2-8510-5d629b6c12f2@t16g2000vba.googlegroups.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 180 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1322825051 news.xs4all.nl 6877 [2001:888:2000:d::a6]:46600 X-Complaints-To: abuse@xs4all.nl Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:16539 durumdara, 02.12.2011 09:13: > So: may I don't understand the things well, but I thought that parser > drop the "nondata" CRLF-s + other characters (not preserve them). Well, it does that, at least on my side (which is not under Windows): =================== original=''' AnyText ''' from xml.dom.minidom import parse def main(): f = open('test.0.xml', 'wb') f.write(original.strip().replace('\n', '\r\n').encode('utf8')) f.close() xo = parse('test.0.xml') de = xo.documentElement print(repr(de.childNodes[0].nodeValue)) print(repr(de.childNodes[1].childNodes[0].nodeValue)) if __name__ == '__main__': main() =================== This prints '\n ' and '\n AnyText\n ' on my side, so the whitespace normalisation in the parser properly did its work. > Then don't matters that I read the XML from a file, or I create it > from code, because all of them generating SAME RESULT. > But Python don't do that. > If I make xml from code, the code is without plus characters. What do you mean by "plus characters"? It's not the "+" character that you are referring to, right? Do you mean additional characters? Such as the additional '\r'? > But Python preserves parsed CRLF characters somewhere, and they are > also flushing into the result. > > Example: > > original=''' > > > > AnyText > > > ''' > > If I parse this, and write with toxml, the CRLF-s remaining in the > code, but if I create this document line by line, there is no CRLF, > the toxml write "only lined" xml. > > This also meaning that if I use prettyxml call, to prettying the xml, > the file size is growing. > > If there is a multiple processing queue - if two pythons communicating > in xml files, the size can growing every time. > > Py1 - read the Py2's file, process it, and write to a result file > Py2 - read the Py1's result file, process it, and pass back to Py1 > this can grow the file with each call, because "pretty" CRLF-s not > normalized out from the code. > > original=''' > > > > AnyText > > > ''' > > def main(): > f = open('test.0.xml','w') > f.write(original.strip()) > f.close() > > for i in range(1, 10 + 1): > xo = parse('test.%d.xml' % (i - 1)) > de = xo.documentElement > de.setAttribute('c', str(i)) > t = de.getElementsByTagName('element')[0] > tn = t.childNodes[0] > print (dir(t)) > print (tn) > print (tn.nodeValue) > tn.nodeValue = str(i) + '\t' + '\n' > #s = xo.toxml() > s = xo.toprettyxml() > f = open('test.%d.xml' % i,'w') > f.write(s) > f.close() > > sys.exit() > > And: because Python is not converting CRLF to&013; I cannot make > different from "prettied source's CRLF" (loaded from template file), > "my own pretty's CRLF" (my own topretty), and really contained CRLF > (for example a memo field's value). > > My case is that the processor application (for whom I pass the XML > from Python) is sensitive to "plus CRLF"-s in text nodes, I must do > something these "plus" items to avoid external's program errors. > > I got these templates and input files from prettied format (with > CRLFS), but I must "eat" them to make an XML that one lined if > possible. > > I hope you understand my problem with it. Still not quite, but never mind. May or may not be a problem in minidom or your code. For example, you shouldn't open the output file in text mode but in binary mode (i.e. "wb") because you are writing bytes into it. Here's what I tried with ElementTree, and it seems to do what your code above wants. The indent() function is taken from Fredrik's element lib page: http://effbot.org/zone/element-lib.htm ======================== original=''' AnyText ''' def indent(elem, level=0): i = "\n" + level*" " if len(elem): if not elem.text or not elem.text.strip(): elem.text = i + " " if not elem.tail or not elem.tail.strip(): elem.tail = i for elem in elem: indent(elem, level+1) if not elem.tail or not elem.tail.strip(): elem.tail = i else: if level and (not elem.tail or not elem.tail.strip()): elem.tail = i def main(): f = open('test.0.xml','w', encoding='utf8') f.write(original.strip()) f.close() from xml.etree.cElementTree import parse for i in range(10): tree = parse('test.%d.xml' % i) root = tree.getroot() root.set('c', str(i+1)) t = root.find('.//element') t.text = '%d\t\n' % (i+1) indent(root) tree.write('test.%d.xml' % (i+1), encoding='utf8') if __name__ == '__main__': main() ======================== Hope that helps, Stefan