Groups > comp.lang.python > #33019 > unrolled thread

xml data or other?

Started by	Artie Ziff <artie.ziff@gmail.com>
First post	2012-11-09 04:54 -0800
Last post	2012-11-13 06:05 -0800
Articles	9 — 5 participants

Back to article view | Back to comp.lang.python

  xml data or other? Artie Ziff <artie.ziff@gmail.com> - 2012-11-09 04:54 -0800
    Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-09 05:50 -0800
      Re: xml data or other? Artie Ziff <artie.ziff@gmail.com> - 2012-11-18 05:32 -0800
        Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-18 07:54 -0800
          Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-18 07:58 -0800
      RE: xml data or other? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-19 21:42 +0000
      Re: xml data or other? Stefan Behnel <stefan_ml@behnel.de> - 2012-11-20 06:48 +0100
    Re: xml data or other? shivers.paul@yahoo.co.uk - 2012-11-13 06:05 -0800
    Re: xml data or other? shivers.paul@yahoo.co.uk - 2012-11-13 06:05 -0800

#33019 — xml data or other?

From	Artie Ziff <artie.ziff@gmail.com>
Date	2012-11-09 04:54 -0800
Subject	xml data or other?
Message-ID	<mailman.3490.1352465695.27098.python-list@python.org>

Hello,

I want to process XML-like data like this:

<testname=ltpacpi.sh>
	<description>
		ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.

	<\description>
	<test_location>
		ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
	<\test_location>
<\testname>


After manually editing the data above, the python module 
xml.etree.ElementTree parses it without failing due to error in the data 
structure.

Edits were substituting '/' for '\' on the end tags, and adding the 
following structure:

<?xml version="1.0"?>
<data>
   <testname name=ltpacpi.sh>
     ...
   <\testname>
</data>


Is there a name for the format above (perhaps xhtml)?
I'd like to find a python module that can translate it to proper xml. 
Does one exist? etree?

Many thanks!
az

[toc] | [next] | [standalone]

#33023

From	rusi <rustompmody@gmail.com>
Date	2012-11-09 05:50 -0800
Message-ID	<96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com>
In reply to	#33019

On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> Hello,
>
> I want to process XML-like data like this:
<snipped>
> Edits were substituting '/' for '\' on the end tags, and adding the
> following structure:

If thats all you want, you can try the following:

# obviously this should come from a file
input= """<testname=ltpacpi.sh>
        <description>
                ACPI (Advanced Control Power & Integration) testscript
for 2.5 kernels.

        <\description>
        <test_location>
                ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
        <\test_location>
<\testname>"""

prefix = """<?xml version="1.0"?>
<data>
"""

postfix = """</data>"""

correctedInput = prefix + input.replace("\\", "/") + postfix
# submit correctedinput to etree

[toc] | [prev] | [next] | [standalone]

#33498

From	Artie Ziff <artie.ziff@gmail.com>
Date	2012-11-18 05:32 -0800
Message-ID	<mailman.3790.1353245559.27098.python-list@python.org>
In reply to	#33023

On 11/9/12 5:50 AM, rusi wrote:
> On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> # submit correctedinput to etree 
I was very grateful to get the "leg up" on getting started down that 
right path with my coding. Many thanks to you, rusi. I took your 
excellent advices and have this working.

class Converter():
     PREFIX = """<?xml version="1.0"?>
     <data>
     """
     POSTFIX = "</data>"
     def __init__(self, data):
         self.data = data
         self.writeXML()
     def writeXML(self):
         pattern = re.compile('<testname=(.*)>')
         replaceStr = r'<testname name="\1">'
         xmlData = re.sub(pattern, replaceStr, self.data)
         self.dataXML = self.PREFIX + xmlData.replace("\\", "/") + 
self.POSTFIX

###  main
# input to script is directory:
# sanitize trailing slash
testPkgDir = sys.argv[1].rstrip('/')
# Within each test package directory is doc/testcase
tcDocDir = "doc/testcases"
# set input dir, containing broken files
tcTxtDir = os.path.join(testPkgDir, tcDocDir)
# set output dir, to write proper XML files
tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
if not os.path.exists(tcXmlDir):
     os.makedirs(tcXmlDir)
# iterate through files in input dir
for filename in os.listdir(tcTxtDir):
     # set filepaths
     filepathTXT = os.path.join(tcTxtDir, filename)
     base = os.path.splitext(filename)[0]
     fileXML = base + ".xml"
     filepathXML = os.path.join(tcXmlDir, fileXML)
     # read broken file, convert to proper XML
     with open(filepathTXT) as f:
         c = Converter(f.read())
         xmlFO = open(filepathXML, 'w')   # xmlFileObject
         xmlFO.write(c.dataXML)
         xmlFO.close()

###

Writing XML files so to see whats happening. My plan is to
keep xml data in memory and parse with xml.etree.ElementTree.

Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:

<testname name="cron_test.sh">
     <description>
         This testcase tests if crontab <filename> installs the cronjob
         and cron schedules the job correctly.
     <\description>

##

What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?

I am open to comments to improve my code more to be more readable,
pythonic, or better.

Many thanks
AZ

[toc] | [prev] | [next] | [standalone]

#33500

From	rusi <rustompmody@gmail.com>
Date	2012-11-18 07:54 -0800
Message-ID	<633d181e-1337-456e-8e97-d187654c1f2a@uk1g2000pbb.googlegroups.com>
In reply to	#33498

On Nov 18, 6:32 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> Unfortunately, xml parsing fails due to angle brackets inside
> description tags. In particular, xml.etree.ElementTree.parse()
> aborts on '<' inside xml data such as the following:
>
> <testname name="cron_test.sh">
>      <description>
>          This testcase tests if crontab <filename> installs the cronjob
>          and cron schedules the job correctly.
>      <\description>
>
> ##
>
> What is right way to handle the extra angle brackets?
> Substitute on line-by-line basis, if that works?
> Or learn to write a simple stack-style parser, or
> recursive descent, it may be called?
>
> I am open to comments to improve my code more to be more readable,
> pythonic, or better.
>
> Many thanks
> AZ

Start with cgi.escape perhaps?
http://docs.python.org/2/library/cgi.html

[toc] | [prev] | [next] | [standalone]

#33501

From	rusi <rustompmody@gmail.com>
Date	2012-11-18 07:58 -0800
Message-ID	<0b611cab-bd5c-495f-885c-0372e85c1770@kt16g2000pbb.googlegroups.com>
In reply to	#33500

On Nov 18, 8:54 pm, rusi <rustompm...@gmail.com> wrote:

> Start with cgi.escape perhaps?http://docs.python.org/2/library/cgi.html

This may be a better link for starters
http://wiki.python.org/moin/EscapingHtml
(Note the escaping xml at the bottom)

[toc] | [prev] | [next] | [standalone]

#33559

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-11-19 21:42 +0000
Message-ID	<mailman.14.1353362801.29569.python-list@python.org>
In reply to	#33023

Artie Ziff wrote:
> 
> On 11/9/12 5:50 AM, rusi wrote:
> > On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> > # submit correctedinput to etree
> I was very grateful to get the "leg up" on getting started down that
> right path with my coding. Many thanks to you, rusi. I took your
> excellent advices and have this working.
> 
> class Converter():
>      PREFIX = """<?xml version="1.0"?>
>      <data>
>      """
>      POSTFIX = "</data>"
>      def __init__(self, data):
>          self.data = data
>          self.writeXML()
>      def writeXML(self):
>          pattern = re.compile('<testname=(.*)>')
>          replaceStr = r'<testname name="\1">'
>          xmlData = re.sub(pattern, replaceStr, self.data)
>          self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
> self.POSTFIX
> 
> ###  main
> # input to script is directory:
> # sanitize trailing slash
> testPkgDir = sys.argv[1].rstrip('/')
> # Within each test package directory is doc/testcase
> tcDocDir = "doc/testcases"
> # set input dir, containing broken files
> tcTxtDir = os.path.join(testPkgDir, tcDocDir)
> # set output dir, to write proper XML files
> tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
> if not os.path.exists(tcXmlDir):
>      os.makedirs(tcXmlDir)
> # iterate through files in input dir
> for filename in os.listdir(tcTxtDir):
>      # set filepaths
>      filepathTXT = os.path.join(tcTxtDir, filename)
>      base = os.path.splitext(filename)[0]
>      fileXML = base + ".xml"
>      filepathXML = os.path.join(tcXmlDir, fileXML)
>      # read broken file, convert to proper XML
>      with open(filepathTXT) as f:
>          c = Converter(f.read())
>          xmlFO = open(filepathXML, 'w')   # xmlFileObject
>          xmlFO.write(c.dataXML)
>          xmlFO.close()
> 
> ###
> 
> Writing XML files so to see whats happening. My plan is to
> keep xml data in memory and parse with xml.etree.ElementTree.
> 
> Unfortunately, xml parsing fails due to angle brackets inside
> description tags. In particular, xml.etree.ElementTree.parse()
> aborts on '<' inside xml data such as the following:
> 
> <testname name="cron_test.sh">
>      <description>
>          This testcase tests if crontab <filename> installs the cronjob
>          and cron schedules the job correctly.
>      <\description>
> 
> ##
> 
> What is right way to handle the extra angle brackets?
> Substitute on line-by-line basis, if that works?
> Or learn to write a simple stack-style parser, or
> recursive descent, it may be called?

I think your description text should be in a CDATA section.
http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML

~Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#33587

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-11-20 06:48 +0100
Message-ID	<mailman.30.1353390518.29569.python-list@python.org>
In reply to	#33023

Prasad, Ramit, 19.11.2012 22:42:
> Artie Ziff wrote:
>> Writing XML files so to see whats happening. My plan is to
>> keep xml data in memory and parse with xml.etree.ElementTree.
>>
>> Unfortunately, xml parsing fails due to angle brackets inside
>> description tags. In particular, xml.etree.ElementTree.parse()
>> aborts on '<' inside xml data such as the following:
>>
>> <testname name="cron_test.sh">
>>      <description>
>>          This testcase tests if crontab <filename> installs the cronjob
>>          and cron schedules the job correctly.
>>      <\description>
>>
>> ##
>>
>> What is right way to handle the extra angle brackets?
>> Substitute on line-by-line basis, if that works?
>> Or learn to write a simple stack-style parser, or
>> recursive descent, it may be called?
> 
> I think your description text should be in a CDATA section.
> http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML

Ah, don't bother with CDATA. Just make sure the data gets properly escaped,
any XML serialiser will do that for you. Just generate the XML using
ElementTree and you'll be fine. Generating XML as literal text is not a
good idea.

Stefan

[toc] | [prev] | [next] | [standalone]

#33241

From	shivers.paul@yahoo.co.uk
Date	2012-11-13 06:05 -0800
Message-ID	<834d312e-2fac-4e03-8533-2c4a6a33b9be@googlegroups.com>
In reply to	#33019

On Friday, November 9, 2012 12:54:56 PM UTC, Artie Ziff wrote:
> Hello,
> 
> 
> 
> I want to process XML-like data like this:
> 
> 
> 
> <testname=ltpacpi.sh>
> 
> 	<description>
> 
> 		ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.
> 
> 
> 
> 	<\description>
> 
> 	<test_location>
> 
> 		ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
> 
> 	<\test_location>
> 
> <\testname>
> 
> 
> 
> 
> 
> After manually editing the data above, the python module 
> 
> xml.etree.ElementTree parses it without failing due to error in the data 
> 
> structure.
> 
> 
> 
> Edits were substituting '/' for '\' on the end tags, and adding the 
> 
> following structure:
> 
> 
> 
> <?xml version="1.0"?>
> 
> <data>
> 
>    <testname name=ltpacpi.sh>
> 
>      ...
> 
>    <\testname>
> 
> </data>
> 
> 
> 
> 
> 
> Is there a name for the format above (perhaps xhtml)?
> 
> I'd like to find a python module that can translate it to proper xml. 
> 
> Does one exist? etree?
> 
> 
> 
> Many thanks!
> 
> az

maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info

[toc] | [prev] | [next] | [standalone]

#33242

From	shivers.paul@yahoo.co.uk
Date	2012-11-13 06:05 -0800
Message-ID	<mailman.3627.1352815553.27098.python-list@python.org>
In reply to	#33019

On Friday, November 9, 2012 12:54:56 PM UTC, Artie Ziff wrote:
> Hello,
> 
> 
> 
> I want to process XML-like data like this:
> 
> 
> 
> <testname=ltpacpi.sh>
> 
> 	<description>
> 
> 		ACPI (Advanced Control Power & Integration) testscript for 2.5 kernels.
> 
> 
> 
> 	<\description>
> 
> 	<test_location>
> 
> 		ltp/testcases/kernel/device-drivers/acpi/ltpacpi.sh
> 
> 	<\test_location>
> 
> <\testname>
> 
> 
> 
> 
> 
> After manually editing the data above, the python module 
> 
> xml.etree.ElementTree parses it without failing due to error in the data 
> 
> structure.
> 
> 
> 
> Edits were substituting '/' for '\' on the end tags, and adding the 
> 
> following structure:
> 
> 
> 
> <?xml version="1.0"?>
> 
> <data>
> 
>    <testname name=ltpacpi.sh>
> 
>      ...
> 
>    <\testname>
> 
> </data>
> 
> 
> 
> 
> 
> Is there a name for the format above (perhaps xhtml)?
> 
> I'd like to find a python module that can translate it to proper xml. 
> 
> Does one exist? etree?
> 
> 
> 
> Many thanks!
> 
> az

maybe an xml tool would be better, a good list of xml tools here; http://www.xml-data.info

[toc] | [prev] | [standalone]

csiph-web

xml data or other?

Contents

#33019 — xml data or other?

#33023

#33498

#33500

#33501

#33559

#33587

#33241

#33242