Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'broken': 0.03; 'output': 0.04; '"""': 0.05; 'completeness': 0.07; 'data):': 0.07; 'filename': 0.07; 'parsing': 0.07; 'postfix': 0.07; 'prefix': 0.07; 'trailing': 0.07; '###': 0.09; "'w')": 0.09; 'brackets': 0.09; 'filename)': 0.09; 'iterate': 0.09; 'received:155': 0.09; 'self.data': 0.09; 'subject:xml': 0.09; 'def': 0.10; 'file,': 0.15; 'angle': 0.16; 'disclaimers': 0.16; 'disclaimers,': 0.16; 'from:addr:jpmorgan.com': 0.16; 'installs': 0.16; 'received:155.180': 0.16; 'received:159.53': 0.16; 'received:exchad.jpmchase.net': 0.16; 'received:jpmchase.com': 0.16; 'received:jpmchase.net': 0.16; 'securities,': 0.16; 'self.data)': 0.16; 'slash': 0.16; 'substitute': 0.16; 'tags.': 0.16; 'url:disclosures': 0.16; 'url:jpmorgan': 0.16; 'wrote:': 0.17; 'section.': 0.17; 'working.': 0.17; 'tests': 0.18; 'input': 0.18; 'memory': 0.18; 'to:name:python-list@python.org': 0.20; 'basis,': 0.22; 'parse': 0.22; 'to:2**1': 0.23; 'received:169.254': 0.24; 'script': 0.24; 'header:In-Reply-To:1': 0.25; 'url:wiki': 0.26; 'am,': 0.27; 'accuracy': 0.27; 'schedules': 0.29; 'subject:other': 0.29; 'testcase': 0.29; 'url:wikipedia': 0.29; 'convert': 0.29; 'received:169': 0.29; 'skip:_ 10': 0.29; 'class': 0.29; 'getting': 0.33; 'correctly.': 0.33; 'subject:data': 0.33; 'handle': 0.33; 'to:addr:python-list': 0.33; 'text': 0.34; 'thanks': 0.34; 'dir': 0.35; 'nov': 0.35; 'path': 0.35; 'pm,': 0.35; 'subject:?': 0.35; 'url:org': 0.36; 'test': 0.36; 'should': 0.36; 'charset:us-ascii': 0.36; 'xml': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'skip:o 20': 0.38; 'url:en': 0.38; 'description': 0.39; 'to:addr:python.org': 0.39; 'skip:" 10': 0.40; 'think': 0.40; 'your': 0.60; 'skip:u 10': 0.60; 'containing': 0.61; 'information,': 0.63; 'url:email': 0.63; 'within': 0.64; 'legal': 0.65; 'due': 0.66; 'subject': 0.66; 'purchase': 0.67; 'sale': 0.76; 'directory:': 0.84; 'received:169.254.8': 0.84; 'rusi': 0.91 X-DKIM: OpenDKIM Filter v2.1.3 sz1.jpmchase.com qAJLgF0A025183 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpmorgan.com; s=smtpout; t=1353361336; bh=AOE0xhIEugl/hri1hDeYnt+N+VVOTj0dFIIQJBT+iAc=; h=From:To:Subject:Date:Message-ID:References:In-Reply-To: Content-Transfer-Encoding:MIME-Version:Content-Type; b=QcjI8r2Rb3cqgzSTG00Ik9pY+lvBhBFodFptY+I6U4wBNUfrsYndqPKeQItnsyYmD 83so4CqiFV7KkLtxnl9jYADH43O2sl99PwgHtxkfsT5IIo/+EWQMQkQP6czjiO7Ydb yMm7WnZX72EErd0xZkglAuK/ymoa1qcioLFIy6Bo= From: "Prasad, Ramit" To: Artie Ziff , "python-list@python.org" Subject: RE: xml data or other? Thread-Topic: xml data or other? Thread-Index: AQHNvoElKNZqZVl2gES8glYpOgXvzZfv+RsAgAHFKUA= Date: Mon, 19 Nov 2012 21:42:00 +0000 References: <96b24715-cb4b-4588-844e-fc2e2f51a170@m4g2000pbd.googlegroups.com> <50A8E36A.5010606@gmail.com> In-Reply-To: <50A8E36A.5010606@gmail.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.67.79.47] Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-DLP-FWD: Yes Content-Type: text/plain; charset="us-ascii" X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 48 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1353362801 news.xs4all.nl 6887 [2001:888:2000:d::a6]:35489 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:33559 Artie Ziff wrote:=0D=0A> =0D=0A> On 11/9/12 5:50 AM, rusi wrote:=0D=0A> > O= n Nov 9, 5:54 pm, Artie Ziff wrote:=0D=0A>= > # submit correctedinput to etree=0D=0A> I was very grateful to get the "= leg up" on getting started down that=0D=0A> right path with my coding=2E Ma= ny thanks to you, rusi=2E I took your=0D=0A> excellent advices and have thi= s working=2E=0D=0A> =0D=0A> class Converter():=0D=0A> PREFIX =3D """=0D=0A> =0D=0A> """=0D=0A> POST= FIX =3D ""=0D=0A> def __init__(self, data):=0D=0A> sel= f=2Edata =3D data=0D=0A> self=2EwriteXML()=0D=0A> def writeXM= L(self):=0D=0A> pattern =3D re=2Ecompile('')=0D= =0A> replaceStr =3D r''=0D=0A> xmlD= ata =3D re=2Esub(pattern, replaceStr, self=2Edata)=0D=0A> self=2Ed= ataXML =3D self=2EPREFIX + xmlData=2Ereplace("\\", "/") +=0D=0A> self=2EPOS= TFIX=0D=0A> =0D=0A> ### main=0D=0A> # input to script is directory:=0D=0A>= # sanitize trailing slash=0D=0A> testPkgDir =3D sys=2Eargv[1]=2Erstrip('/'= )=0D=0A> # Within each test package directory is doc/testcase=0D=0A> tcDocD= ir =3D "doc/testcases"=0D=0A> # set input dir, containing broken files=0D= =0A> tcTxtDir =3D os=2Epath=2Ejoin(testPkgDir, tcDocDir)=0D=0A> # set outpu= t dir, to write proper XML files=0D=0A> tcXmlDir =3D os=2Epath=2Ejoin(testP= kgDir, tcDocDir + "_XML")=0D=0A> if not os=2Epath=2Eexists(tcXmlDir):=0D=0A= > os=2Emakedirs(tcXmlDir)=0D=0A> # iterate through files in input dir= =0D=0A> for filename in os=2Elistdir(tcTxtDir):=0D=0A> # set filepaths= =0D=0A> filepathTXT =3D os=2Epath=2Ejoin(tcTxtDir, filename)=0D=0A> = base =3D os=2Epath=2Esplitext(filename)[0]=0D=0A> fileXML =3D base = + "=2Exml"=0D=0A> filepathXML =3D os=2Epath=2Ejoin(tcXmlDir, fileXML)= =0D=0A> # read broken file, convert to proper XML=0D=0A> with ope= n(filepathTXT) as f:=0D=0A> c =3D Converter(f=2Eread())=0D=0A> = xmlFO =3D open(filepathXML, 'w') # xmlFileObject=0D=0A> xm= lFO=2Ewrite(c=2EdataXML)=0D=0A> xmlFO=2Eclose()=0D=0A> =0D=0A> ###= =0D=0A> =0D=0A> Writing XML files so to see whats happening=2E My plan is t= o=0D=0A> keep xml data in memory and parse with xml=2Eetree=2EElementTree= =2E=0D=0A> =0D=0A> Unfortunately, xml parsing fails due to angle brackets i= nside=0D=0A> description tags=2E In particular, xml=2Eetree=2EElementTree= =2Eparse()=0D=0A> aborts on '<' inside xml data such as the following:=0D= =0A> =0D=0A> =0D=0A> = =0D=0A> This testcase tests if crontab installs the cro= njob=0D=0A> and cron schedules the job correctly=2E=0D=0A> <\= description>=0D=0A> =0D=0A> ##=0D=0A> =0D=0A> What is right way to handle t= he extra angle brackets?=0D=0A> Substitute on line-by-line basis, if that w= orks?=0D=0A> Or learn to write a simple stack-style parser, or=0D=0A> recur= sive descent, it may be called?=0D=0A=0D=0AI think your description text sh= ould be in a CDATA section=2E=0D=0Ahttp://en=2Ewikipedia=2Eorg/wiki/CDATA#C= DATA_sections_in_XML=0D=0A=0D=0A~Ramit=0D=0A=0D=0A=0D=0AThis email is confi= dential and subject to important disclaimers and=0D=0Aconditions including = on offers for the purchase or sale of=0D=0Asecurities, accuracy and complet= eness of information, viruses,=0D=0Aconfidentiality, legal privilege, and l= egal entity disclaimers,=0D=0Aavailable at http://www=2Ejpmorgan=2Ecom/page= s/disclosures/email=2E