Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #33498

Re: xml data or other?

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <artie.ziff@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'broken': 0.03; 'output': 0.04; '"""': 0.05; 'data):': 0.07; 'filename': 0.07; 'parsing': 0.07; 'postfix': 0.07; 'prefix': 0.07; 'trailing': 0.07; '###': 0.09; "'w')": 0.09; 'brackets': 0.09; 'filename)': 0.09; 'iterate': 0.09; 'self.data': 0.09; 'subject:xml': 0.09; 'def': 0.10; 'file,': 0.15; 'angle': 0.16; 'installs': 0.16; 'self.data)': 0.16; 'slash': 0.16; 'substitute': 0.16; 'tags.': 0.16; 'wrote:': 0.17; 'working.': 0.17; 'tests': 0.18; 'input': 0.18; 'memory': 0.18; 'basis,': 0.22; 'parse': 0.22; 'script': 0.24; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'am,': 0.27; 'received:24': 0.27; 'schedules': 0.29; 'subject:other': 0.29; 'testcase': 0.29; 'convert': 0.29; 'skip:_ 10': 0.29; 'class': 0.29; 'code': 0.31; 'getting': 0.33; 'comments': 0.33; 'correctly.': 0.33; 'subject:data': 0.33; 'handle': 0.33; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'thanks': 0.34; 'dir': 0.35; 'nov': 0.35; 'path': 0.35; 'open': 0.35; 'pm,': 0.35; 'received:209.85.220': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'message-id:@gmail.com': 0.36; 'test': 0.36; 'xml': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'skip:o 20': 0.38; 'description': 0.39; 'to:addr:python.org': 0.39; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'containing': 0.61; 'more': 0.63; 'within': 0.64; 'due': 0.66; 'directory:': 0.84; 'rusi': 0.91
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=e+mYyRUL9BrZhFE14Uipoo6Z2FJmUt1Aqv0M0XFO24Q=; b=jzMhsXOCXpMJrybDJzJ1SSCzKMbYR3fsgOBZLX1ZlWKK4ZSWW2NT6bh7FWiYj2MQLn 76ag66NjIowMcAX/JWoJXa1EnfjuA0Rh9Mxw25bmmwvrB91N7QCmKogwVdtzKO+GRO8+ NBnmhVxTA5IfUr4rNcFHyZUTqKQWo02ad097zsUkgzTXkU3vSxyncf3Rgx5Fhgt/iV9I Ku6Kdgi/BwoMyZ2NwWnbIzlnJ306nj6NaIFHPM3zBp6qo/imBFL2W8M4oOeTakwmYsp+ SDNEmb3A5IWZ6vRjHMvDEwqsFwaQ7FKLeRtn89NtmatB9dco5cq6rc0K3VouPd6yecDN ZOcg==
Date Sun, 18 Nov 2012 05:32:26 -0800
From Artie Ziff <artie.ziff@gmail.com>
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0) Gecko/20121026 Thunderbird/16.0.2
MIME-Version 1.0
To python-list@python.org
Subject Re: xml data or other?
References <mailman.3490.1352465695.27098.python-list@python.org> <96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com>
In-Reply-To <96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com>
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding 7bit
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.3790.1353245559.27098.python-list@python.org> (permalink)
Lines 76
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1353245559 news.xs4all.nl 6882 [2001:888:2000:d::a6]:48268
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:33498

Show key headers only | View raw


On 11/9/12 5:50 AM, rusi wrote:
> On Nov 9, 5:54 pm, Artie Ziff <artie.z...@gmail.com> wrote:
> # submit correctedinput to etree 
I was very grateful to get the "leg up" on getting started down that 
right path with my coding. Many thanks to you, rusi. I took your 
excellent advices and have this working.

class Converter():
     PREFIX = """<?xml version="1.0"?>
     <data>
     """
     POSTFIX = "</data>"
     def __init__(self, data):
         self.data = data
         self.writeXML()
     def writeXML(self):
         pattern = re.compile('<testname=(.*)>')
         replaceStr = r'<testname name="\1">'
         xmlData = re.sub(pattern, replaceStr, self.data)
         self.dataXML = self.PREFIX + xmlData.replace("\\", "/") + 
self.POSTFIX

###  main
# input to script is directory:
# sanitize trailing slash
testPkgDir = sys.argv[1].rstrip('/')
# Within each test package directory is doc/testcase
tcDocDir = "doc/testcases"
# set input dir, containing broken files
tcTxtDir = os.path.join(testPkgDir, tcDocDir)
# set output dir, to write proper XML files
tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
if not os.path.exists(tcXmlDir):
     os.makedirs(tcXmlDir)
# iterate through files in input dir
for filename in os.listdir(tcTxtDir):
     # set filepaths
     filepathTXT = os.path.join(tcTxtDir, filename)
     base = os.path.splitext(filename)[0]
     fileXML = base + ".xml"
     filepathXML = os.path.join(tcXmlDir, fileXML)
     # read broken file, convert to proper XML
     with open(filepathTXT) as f:
         c = Converter(f.read())
         xmlFO = open(filepathXML, 'w')   # xmlFileObject
         xmlFO.write(c.dataXML)
         xmlFO.close()

###

Writing XML files so to see whats happening. My plan is to
keep xml data in memory and parse with xml.etree.ElementTree.

Unfortunately, xml parsing fails due to angle brackets inside
description tags. In particular, xml.etree.ElementTree.parse()
aborts on '<' inside xml data such as the following:

<testname name="cron_test.sh">
     <description>
         This testcase tests if crontab <filename> installs the cronjob
         and cron schedules the job correctly.
     <\description>

##

What is right way to handle the extra angle brackets?
Substitute on line-by-line basis, if that works?
Or learn to write a simple stack-style parser, or
recursive descent, it may be called?

I am open to comments to improve my code more to be more readable,
pythonic, or better.

Many thanks
AZ

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

xml data or other? Artie Ziff <artie.ziff@gmail.com> - 2012-11-09 04:54 -0800
  Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-09 05:50 -0800
    Re: xml data or other? Artie Ziff <artie.ziff@gmail.com> - 2012-11-18 05:32 -0800
      Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-18 07:54 -0800
        Re: xml data or other? rusi <rustompmody@gmail.com> - 2012-11-18 07:58 -0800
    RE: xml data or other? "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-11-19 21:42 +0000
    Re: xml data or other? Stefan Behnel <stefan_ml@behnel.de> - 2012-11-20 06:48 +0100
  Re: xml data or other? shivers.paul@yahoo.co.uk - 2012-11-13 06:05 -0800
  Re: xml data or other? shivers.paul@yahoo.co.uk - 2012-11-13 06:05 -0800

csiph-web