Groups > comp.lang.python > #16428 > unrolled thread

Python 3 - xml - crlf handling problem

Started by	durumdara <durumdara@gmail.com>
First post	2011-11-30 04:08 -0800
Last post	2011-12-02 12:23 +0100
Articles	4 — 2 participants

Back to article view | Back to comp.lang.python

  Python 3 - xml - crlf handling problem durumdara <durumdara@gmail.com> - 2011-11-30 04:08 -0800
    Re: Python 3 - xml - crlf handling problem Stefan Behnel <stefan_ml@behnel.de> - 2011-11-30 14:47 +0100
      Re: Python 3 - xml - crlf handling problem durumdara <durumdara@gmail.com> - 2011-12-02 00:13 -0800
        Re: Python 3 - xml - crlf handling problem Stefan Behnel <stefan_ml@behnel.de> - 2011-12-02 12:23 +0100

#16428 — Python 3 - xml - crlf handling problem

From	durumdara <durumdara@gmail.com>
Date	2011-11-30 04:08 -0800
Subject	Python 3 - xml - crlf handling problem
Message-ID	<3aae0b18-a194-444f-a2fc-da156204bd95@20g2000yqa.googlegroups.com>

Hi!

As I see that XML parsing is "wrong" in Python.

I must use predefined XML files, parsing them, extending them, and
produce some result.

But as I see that in Windows this is working wrong.

When the predefined XMLs are "formatted" (prettied) with CRLFs, then
the parser keeps these plus LF characters (not handle the logic that
CR = LF = CRLF), and it is appearing in the new result too.

    xo = parse('test_original.xml')
    de = xo.documentElement
    de.setAttribute('b', "2")
    b = xo.toxml('utf-8')
    f = open('test_original2.xml', 'wb')
    f.write(b)
    f.close()

And: if I used text elements, this can extend the information with
plus characters and make wrong xml...

I can use only "myowngenerated", and not prettied xmls because of this
problem!

Is this normal?

Thanks for your read:
   dd

[toc] | [next] | [standalone]

#16433

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2011-11-30 14:47 +0100
Message-ID	<mailman.3158.1322660900.27778.python-list@python.org>
In reply to	#16428

durumdara, 30.11.2011 13:08:
> As I see that XML parsing is "wrong" in Python.

You didn't say what you are using for parsing, but from your example, it 
appears likely that you are using the xml.dom.minidom module.


> I must use predefined XML files, parsing them, extending them, and
> produce some result.
>
> But as I see that in Windows this is working wrong.
>
> When the predefined XMLs are "formatted" (prettied) with CRLFs, then
> the parser keeps these plus LF characters (not handle the logic that
> CR = LF = CRLF), and it is appearing in the new result too.

I assume that you are referring to XML's newline normalisation algorithm? 
That should normally be handled by the parser, which, in the case of 
minidom, is usually expat. And I seriously doubt that expat has a problem 
with something as basic as newline normalisation.

Did you verify that the newlines are really not being converted by the 
parser? From your example, I can only see that you are serialising the XML 
tree back into a file, which may or may not alter the line endings by 
itself. Instead, take a look at the text content in the tree right after 
parsing to see how line endings look at that level.


>      xo = parse('test_original.xml')
>      de = xo.documentElement
>      de.setAttribute('b', "2")
>      b = xo.toxml('utf-8')
>      f = open('test_original2.xml', 'wb')
>      f.write(b)
>      f.close()

This doesn't do any pretty printing, though, in case that's what you were 
really after (which appears likely according to your comments).


> And: if I used text elements, this can extend the information with
> plus characters and make wrong xml...

Sorry, I don't understand this sentence.


> I can use only "myowngenerated", and not prettied xmls because of this
> problem!

Then what is the actual problem? Do you get an error somewhere? And if so, 
could you paste the exact error message and describe what you do to produce 
it? The mere fact that the line endings use the normal platform specific 
representation doesn't seem like a problem to me.

Stefan

[toc] | [prev] | [next] | [standalone]

#16537

From	durumdara <durumdara@gmail.com>
Date	2011-12-02 00:13 -0800
Message-ID	<6cc21b82-aa31-47d2-8510-5d629b6c12f2@t16g2000vba.googlegroups.com>
In reply to	#16433

Dear Stefan!


So: may I don't understand the things well, but I thought that parser
drop the "nondata" CRLF-s + other characters (not preserve them).

Then don't matters that I read the XML from a file, or I create it
from code, because all of them generating SAME RESULT.
But Python don't do that.
If I make xml from code, the code is without plus characters.
But Python preserves parsed CRLF characters somewhere, and they are
also flushing into the result.

Example:

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

If I parse this, and write with toxml, the CRLF-s remaining in the
code, but if I create this document line by line, there is no CRLF,
the toxml write "only lined" xml.

This also meaning that if I use prettyxml call, to prettying the xml,
the file size is growing.

If there is a multiple processing queue - if two pythons communicating
in xml files, the size can growing every time.

Py1 - read the Py2's file, process it, and write to a result file
Py2 - read the Py1's result file, process it, and pass back to Py1
this can grow the file with each call, because "pretty" CRLF-s not
normalized out from the code.

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
    <element a="1">
        AnyText
    </element>
</doc>
'''

def main():
    f = open('test.0.xml','w')
    f.write(original.strip())
    f.close()

    for i in range(1, 10 + 1):
        xo = parse('test.%d.xml' % (i - 1))
        de = xo.documentElement
        de.setAttribute('c', str(i))
        t = de.getElementsByTagName('element')[0]
        tn = t.childNodes[0]
        print (dir(t))
        print (tn)
        print (tn.nodeValue)
        tn.nodeValue = str(i) + '\t' + '\n'
        #s = xo.toxml()
        s = xo.toprettyxml()
        f = open('test.%d.xml' % i,'w')
        f.write(s)
        f.close()

    sys.exit()

And: because Python is not converting CRLF to &013; I cannot make
different from "prettied source's CRLF" (loaded from template file),
"my own pretty's CRLF" (my own topretty), and really contained CRLF
(for example a memo field's value).

My case is that the processor application (for whom I pass the XML
from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
something these "plus" items to avoid external's program errors.

I got these templates and input files from prettied format (with
CRLFS), but I must "eat" them to make an XML that one lined if
possible.

I hope you understand my problem with it.

Thanks:
   dd

[toc] | [prev] | [next] | [standalone]

#16539

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2011-12-02 12:23 +0100
Message-ID	<mailman.3221.1322825051.27778.python-list@python.org>
In reply to	#16537

durumdara, 02.12.2011 09:13:
> So: may I don't understand the things well, but I thought that parser
> drop the "nondata" CRLF-s + other characters (not preserve them).

Well, it does that, at least on my side (which is not under Windows):

===================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

from xml.dom.minidom import parse

def main():
     f = open('test.0.xml', 'wb')
     f.write(original.strip().replace('\n', '\r\n').encode('utf8'))
     f.close()

     xo = parse('test.0.xml')
     de = xo.documentElement
     print(repr(de.childNodes[0].nodeValue))
     print(repr(de.childNodes[1].childNodes[0].nodeValue))

if __name__ == '__main__':
     main()
===================

This prints '\n    ' and '\n        AnyText\n    ' on my side, so the 
whitespace normalisation in the parser properly did its work.


> Then don't matters that I read the XML from a file, or I create it
> from code, because all of them generating SAME RESULT.
> But Python don't do that.
> If I make xml from code, the code is without plus characters.

What do you mean by "plus characters"? It's not the "+" character that you 
are referring to, right? Do you mean additional characters? Such as the 
additional '\r'?


> But Python preserves parsed CRLF characters somewhere, and they are
> also flushing into the result.
>
> Example:
>
> original='''
> <?xml version="1.0" encoding="utf-8"?>
> <doc a="1">
>      <element a="1">
>          AnyText
>      </element>
> </doc>
> '''
>
> If I parse this, and write with toxml, the CRLF-s remaining in the
> code, but if I create this document line by line, there is no CRLF,
> the toxml write "only lined" xml.
>
> This also meaning that if I use prettyxml call, to prettying the xml,
> the file size is growing.
>
> If there is a multiple processing queue - if two pythons communicating
> in xml files, the size can growing every time.
>
> Py1 - read the Py2's file, process it, and write to a result file
> Py2 - read the Py1's result file, process it, and pass back to Py1
> this can grow the file with each call, because "pretty" CRLF-s not
> normalized out from the code.
>
> original='''
> <?xml version="1.0" encoding="utf-8"?>
> <doc a="1">
>      <element a="1">
>          AnyText
>      </element>
> </doc>
> '''
>
> def main():
>      f = open('test.0.xml','w')
>      f.write(original.strip())
>      f.close()
>
>      for i in range(1, 10 + 1):
>          xo = parse('test.%d.xml' % (i - 1))
>          de = xo.documentElement
>          de.setAttribute('c', str(i))
>          t = de.getElementsByTagName('element')[0]
>          tn = t.childNodes[0]
>          print (dir(t))
>          print (tn)
>          print (tn.nodeValue)
>          tn.nodeValue = str(i) + '\t' + '\n'
>          #s = xo.toxml()
>          s = xo.toprettyxml()
>          f = open('test.%d.xml' % i,'w')
>          f.write(s)
>          f.close()
>
>      sys.exit()
>
> And: because Python is not converting CRLF to&013; I cannot make
> different from "prettied source's CRLF" (loaded from template file),
> "my own pretty's CRLF" (my own topretty), and really contained CRLF
> (for example a memo field's value).
>
> My case is that the processor application (for whom I pass the XML
> from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
> something these "plus" items to avoid external's program errors.
>
> I got these templates and input files from prettied format (with
> CRLFS), but I must "eat" them to make an XML that one lined if
> possible.
>
> I hope you understand my problem with it.

Still not quite, but never mind. May or may not be a problem in minidom or 
your code. For example, you shouldn't open the output file in text mode but 
in binary mode (i.e. "wb") because you are writing bytes into it.

Here's what I tried with ElementTree, and it seems to do what your code 
above wants. The indent() function is taken from Fredrik's element lib page:

http://effbot.org/zone/element-lib.htm

========================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
     <element a="1">
         AnyText
     </element>
</doc>
'''

def indent(elem, level=0):
     i = "\n" + level*"  "
     if len(elem):
         if not elem.text or not elem.text.strip():
             elem.text = i + "  "
         if not elem.tail or not elem.tail.strip():
             elem.tail = i
         for elem in elem:
             indent(elem, level+1)
         if not elem.tail or not elem.tail.strip():
             elem.tail = i
     else:
         if level and (not elem.tail or not elem.tail.strip()):
             elem.tail = i

def main():
     f = open('test.0.xml','w', encoding='utf8')
     f.write(original.strip())
     f.close()

     from xml.etree.cElementTree import parse

     for i in range(10):
         tree = parse('test.%d.xml' % i)
         root = tree.getroot()
         root.set('c', str(i+1))
         t = root.find('.//element')
         t.text = '%d\t\n' % (i+1)
         indent(root)
         tree.write('test.%d.xml' % (i+1), encoding='utf8')

if __name__ == '__main__':
     main()
========================

Hope that helps,

Stefan

[toc] | [prev] | [standalone]

csiph-web

Python 3 - xml - crlf handling problem

Contents

#16428 — Python 3 - xml - crlf handling problem

#16433

#16537

#16539