Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Stefan Behnel <stefan_ml@behnel.de>
Subject: Re: Python 3 - xml - crlf handling problem
Date: Fri, 02 Dec 2011 12:23:54 +0100
References: <3aae0b18-a194-444f-a2fc-da156204bd95@20g2000yqa.googlegroups.com>	<mailman.3158.1322660900.27778.python-list@python.org> <6cc21b82-aa31-47d2-8510-5d629b6c12f2@t16g2000vba.googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110921 Lightning/1.0b2 Thunderbird/3.1.15
In-Reply-To: <6cc21b82-aa31-47d2-8510-5d629b6c12f2@t16g2000vba.googlegroups.com>
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.3221.1322825051.27778.python-list@python.org>
Lines: 180
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:16539

durumdara, 02.12.2011 09:13:
> So: may I don't understand the things well, but I thought that parser
> drop the "nondata" CRLF-s + other characters (not preserve them).

Well, it does that, at least on my side (which is not under Windows):

===================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

from xml.dom.minidom import parse

def main():
     f = open('test.0.xml', 'wb')
     f.write(original.strip().replace('\n', '\r\n').encode('utf8'))
     f.close()

     xo = parse('test.0.xml')
     de = xo.documentElement
     print(repr(de.childNodes[0].nodeValue))
     print(repr(de.childNodes[1].childNodes[0].nodeValue))

if __name__ == '__main__':
     main()
===================

This prints '\n    ' and '\n        AnyText\n    ' on my side, so the 
whitespace normalisation in the parser properly did its work.


> Then don't matters that I read the XML from a file, or I create it
> from code, because all of them generating SAME RESULT.
> But Python don't do that.
> If I make xml from code, the code is without plus characters.

What do you mean by "plus characters"? It's not the "+" character that you 
are referring to, right? Do you mean additional characters? Such as the 
additional '\r'?


> But Python preserves parsed CRLF characters somewhere, and they are
> also flushing into the result.
>
> Example:
>
> original='''
> <?xml version="1.0" encoding="utf-8"?>
> <doc a="1">
>      <element a="1">
>          AnyText
>      </element>
> </doc>
> '''
>
> If I parse this, and write with toxml, the CRLF-s remaining in the
> code, but if I create this document line by line, there is no CRLF,
> the toxml write "only lined" xml.
>
> This also meaning that if I use prettyxml call, to prettying the xml,
> the file size is growing.
>
> If there is a multiple processing queue - if two pythons communicating
> in xml files, the size can growing every time.
>
> Py1 - read the Py2's file, process it, and write to a result file
> Py2 - read the Py1's result file, process it, and pass back to Py1
> this can grow the file with each call, because "pretty" CRLF-s not
> normalized out from the code.
>
> original='''
> <?xml version="1.0" encoding="utf-8"?>
> <doc a="1">
>      <element a="1">
>          AnyText
>      </element>
> </doc>
> '''
>
> def main():
>      f = open('test.0.xml','w')
>      f.write(original.strip())
>      f.close()
>
>      for i in range(1, 10 + 1):
>          xo = parse('test.%d.xml' % (i - 1))
>          de = xo.documentElement
>          de.setAttribute('c', str(i))
>          t = de.getElementsByTagName('element')[0]
>          tn = t.childNodes[0]
>          print (dir(t))
>          print (tn)
>          print (tn.nodeValue)
>          tn.nodeValue = str(i) + '\t' + '\n'
>          #s = xo.toxml()
>          s = xo.toprettyxml()
>          f = open('test.%d.xml' % i,'w')
>          f.write(s)
>          f.close()
>
>      sys.exit()
>
> And: because Python is not converting CRLF to&013; I cannot make
> different from "prettied source's CRLF" (loaded from template file),
> "my own pretty's CRLF" (my own topretty), and really contained CRLF
> (for example a memo field's value).
>
> My case is that the processor application (for whom I pass the XML
> from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
> something these "plus" items to avoid external's program errors.
>
> I got these templates and input files from prettied format (with
> CRLFS), but I must "eat" them to make an XML that one lined if
> possible.
>
> I hope you understand my problem with it.

Still not quite, but never mind. May or may not be a problem in minidom or 
your code. For example, you shouldn't open the output file in text mode but 
in binary mode (i.e. "wb") because you are writing bytes into it.

Here's what I tried with ElementTree, and it seems to do what your code 
above wants. The indent() function is taken from Fredrik's element lib page:

http://effbot.org/zone/element-lib.htm

========================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

def indent(elem, level=0):
     i = "\n" + level*"  "
     if len(elem):
         if not elem.text or not elem.text.strip():
             elem.text = i + "  "
         if not elem.tail or not elem.tail.strip():
             elem.tail = i
         for elem in elem:
             indent(elem, level+1)
         if not elem.tail or not elem.tail.strip():
             elem.tail = i
     else:
         if level and (not elem.tail or not elem.tail.strip()):
             elem.tail = i

def main():
     f = open('test.0.xml','w', encoding='utf8')
     f.write(original.strip())
     f.close()

     from xml.etree.cElementTree import parse

     for i in range(10):
         tree = parse('test.%d.xml' % i)
         root = tree.getroot()
         root.set('c', str(i+1))
         t = root.find('.//element')
         t.text = '%d\t\n' % (i+1)
         indent(root)
         tree.write('test.%d.xml' % (i+1), encoding='utf8')

if __name__ == '__main__':
     main()
========================

Hope that helps,

Stefan