Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.03; 'output': 0.04; '"""': 0.05; 'data):': 0.07; 'filename': 0.07; 'parsing': 0.07; 'postfix': 0.07; 'prefix': 0.07; 'trailing': 0.07; '###': 0.09; "'w')": 0.09; 'brackets': 0.09; 'filename)': 0.09; 'iterate': 0.09; 'self.data': 0.09; 'subject:xml': 0.09; 'def': 0.10; 'file,': 0.15; 'angle': 0.16; 'installs': 0.16; 'self.data)': 0.16; 'slash': 0.16; 'substitute': 0.16; 'tags.': 0.16; 'wrote:': 0.17; 'working.': 0.17; 'tests': 0.18; 'input': 0.18; 'memory': 0.18; 'basis,': 0.22; 'parse': 0.22; 'script': 0.24; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'am,': 0.27; 'received:24': 0.27; 'schedules': 0.29; 'subject:other': 0.29; 'testcase': 0.29; 'convert': 0.29; 'skip:_ 10': 0.29; 'class': 0.29; 'code': 0.31; 'getting': 0.33; 'comments': 0.33; 'correctly.': 0.33; 'subject:data': 0.33; 'handle': 0.33; 'to:addr:python-list': 0.33; 'received:google.com': 0.34; 'thanks': 0.34; 'dir': 0.35; 'nov': 0.35; 'path': 0.35; 'open': 0.35; 'pm,': 0.35; 'received:209.85.220': 0.35; 'subject:?': 0.35; 'received:209.85': 0.35; 'message-id:@gmail.com': 0.36; 'test': 0.36; 'xml': 0.37; 'received:209': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'skip:o 20': 0.38; 'description': 0.39; 'to:addr:python.org': 0.39; 'skip:" 10': 0.40; 'header:Received:5': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'containing': 0.61; 'more': 0.63; 'within': 0.64; 'due': 0.66; 'directory:': 0.84; 'rusi': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=e+mYyRUL9BrZhFE14Uipoo6Z2FJmUt1Aqv0M0XFO24Q=; b=jzMhsXOCXpMJrybDJzJ1SSCzKMbYR3fsgOBZLX1ZlWKK4ZSWW2NT6bh7FWiYj2MQLn 76ag66NjIowMcAX/JWoJXa1EnfjuA0Rh9Mxw25bmmwvrB91N7QCmKogwVdtzKO+GRO8+ NBnmhVxTA5IfUr4rNcFHyZUTqKQWo02ad097zsUkgzTXkU3vSxyncf3Rgx5Fhgt/iV9I Ku6Kdgi/BwoMyZ2NwWnbIzlnJ306nj6NaIFHPM3zBp6qo/imBFL2W8M4oOeTakwmYsp+ SDNEmb3A5IWZ6vRjHMvDEwqsFwaQ7FKLeRtn89NtmatB9dco5cq6rc0K3VouPd6yecDN ZOcg== Date: Sun, 18 Nov 2012 05:32:26 -0800 From: Artie Ziff User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: python-list@python.org Subject: Re: xml data or other? References: <96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com> In-Reply-To: <96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 76 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353245559 news.xs4all.nl 6882 [2001:888:2000:d::a6]:48268 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33498 On 11/9/12 5:50 AM, rusi wrote: > On Nov 9, 5:54 pm, Artie Ziff wrote: > # submit correctedinput to etree I was very grateful to get the "leg up" on getting started down that right path with my coding. Many thanks to you, rusi. I took your excellent advices and have this working. class Converter(): PREFIX = """ """ POSTFIX = "" def __init__(self, data): self.data = data self.writeXML() def writeXML(self): pattern = re.compile('') replaceStr = r'' xmlData = re.sub(pattern, replaceStr, self.data) self.dataXML = self.PREFIX + xmlData.replace("\\", "/") + self.POSTFIX ### main # input to script is directory: # sanitize trailing slash testPkgDir = sys.argv[1].rstrip('/') # Within each test package directory is doc/testcase tcDocDir = "doc/testcases" # set input dir, containing broken files tcTxtDir = os.path.join(testPkgDir, tcDocDir) # set output dir, to write proper XML files tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML") if not os.path.exists(tcXmlDir): os.makedirs(tcXmlDir) # iterate through files in input dir for filename in os.listdir(tcTxtDir): # set filepaths filepathTXT = os.path.join(tcTxtDir, filename) base = os.path.splitext(filename)[0] fileXML = base + ".xml" filepathXML = os.path.join(tcXmlDir, fileXML) # read broken file, convert to proper XML with open(filepathTXT) as f: c = Converter(f.read()) xmlFO = open(filepathXML, 'w') # xmlFileObject xmlFO.write(c.dataXML) xmlFO.close() ### Writing XML files so to see whats happening. My plan is to keep xml data in memory and parse with xml.etree.ElementTree. Unfortunately, xml parsing fails due to angle brackets inside description tags. In particular, xml.etree.ElementTree.parse() aborts on '<' inside xml data such as the following: This testcase tests if crontab installs the cronjob and cron schedules the job correctly. <\description> ## What is right way to handle the extra angle brackets? Substitute on line-by-line basis, if that works? Or learn to write a simple stack-style parser, or recursive descent, it may be called? I am open to comments to improve my code more to be more readable, pythonic, or better. Many thanks AZ