Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Saran Ahluwalia <ahlusar.ahluwalia@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: Understanding how to quote XML string in order to serialize using Python's ElementTree
Date: Sat, 9 Jan 2016 18:13:29 -0500
Lines: 176
Message-ID: <mailman.103.1452381230.2305.python-list@python.org>
References: <d0a2acdb-857c-47c5-a28d-422a8fc4cc74@googlegroups.com> <569184DA.4080009@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <569184DA.4080009@gmail.com>
Precedence: list
Xref: csiph.com comp.lang.python:101418

As mentioned previously, I assigned the variable *data *to the string
''<Response
TransactionID="2" RequestType="HoldInquiry"><Sha
reList>0000',0001,0070,</ShareList></Response>". When any utility that
attempts to parse this string (this is one example of millions found in my
actual data source) you receive that error. You would need to comment out
the following:

        data = row['E']
        print(data) # The before
        data = data[1:-1]
        data = ""'{}'"".format(data)
        print(data) # Sanity check

to achieve the readout.

Unfortunately, I am wondering if there is a scalable solution to the above
issue (perhaps using some form of escape character or regex?). I have ideas
and have tried many but to no avail. There always seems to be an edge case
that escapes me. Thanks.

On Sat, Jan 9, 2016 at 5:08 PM, Karim <kliateni@gmail.com> wrote:

>
>
> On 09/01/2016 21:54, kbtyo wrote:
>
>> My specs:
>>
>> Python 3.4.3
>> Windows 7
>> IDE is Jupyter Notebooks
>>
>> What I have referenced:
>>
>> 1)
>> http://stackoverflow.com/questions/1546717/python-escaping-strings-for-use-in-xml
>>
>> 2)
>>
>> http://stackoverflow.com/questions/7802418/how-to-properly-escape-single-and-double-quotes
>>
>> 3)
>> http://stackoverflow.com/questions/4972210/escaping-characters-in-a-xml-file-with-python
>>
>>
>> Here is the data (in CSV format) and script, respectively, (I have tried
>> variations on serializing Column 'E' using both Sax and ElementTree):
>>
>> i)
>>
>> A,B,C,D,E,F,G,H,I,J
>> "3","8","1","<Request TransactionID="3" RequestType="FOO"><InstitutionISO
>> /><CallID>23</CallID><MemberID>12</MemberID><MemberPassword
>> /><RequestData><AccountNumber>2</AccountNumber><AccountSuffix>85</AccountSuffix><AccountType>S</AccountType><MPIAcctType>Checking</MPIAcctType><TransactionCount>10</TransactionCount></RequestData></Request>","<Response
>> TransactionID="2"
>> RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>","1967-12-25
>> 22:18:13.471000","2005-12-25 22:18:13.768000","2","70","0"
>>
>> ii)
>>
>> #!/usr/bin/python
>> # -*-  coding: utf-8 -*-
>> import os.path
>> import sys
>> import csv
>> from io import StringIO
>> import xml.etree.cElementTree as ElementTree
>> from xml.etree.ElementTree import XMLParser
>> import xml
>> import xml.sax
>> from xml.sax import ContentHandler
>>
>> class MyHandler(xml.sax.handler.ContentHandler):
>>      def __init__(self):
>>          self._charBuffer = []
>>          self._result = []
>>
>>      def _getCharacterData(self):
>>          data = ''.join(self._charBuffer).strip()
>>          self._charBuffer = []
>>          return data.strip() #remove strip() if whitespace is important
>>
>>      def parse(self, f):
>>          xml.sax.parse(f, self)
>>          return self._result
>>
>>      def characters(self, data):
>>          self._charBuffer.append(data)
>>
>>      def startElement(self, name, attrs):
>>          if name == 'Response':
>>              self._result.append({})
>>
>>      def endElement(self, name):
>>          if not name == 'Response': self._result[-1][name] =
>> self._getCharacterData()
>>
>> def read_data(path):
>>      with open(path, 'rU', encoding='utf-8') as data:
>>          reader = csv.DictReader(data, delimiter =',', quotechar="'",
>> skipinitialspace=True)
>>          for row in reader:
>>              yield row
>>
>> if __name__ == "__main__":
>>      empty = ''
>>      Response = 'sample.csv'
>>      for idx, row in enumerate(read_data(Response)):
>>          if idx > 10: break
>>          data = row['E']
>>          print(data) # The before
>>          data = data[1:-1]
>>          data = ""'{}'"".format(data)
>>          print(data) # Sanity check
>> #         data = '<Response TransactionID="2"
>> RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
>>          try:
>>              root = ElementTree.XML(data)
>> #             print(root)
>>          except StopIteration:
>>              raise
>>              pass
>> #         xmlstring = StringIO(data)
>> #         print(xmlstring)
>> #         Handler = MyHandler().parse(xmlstring)
>>
>>
>> Specifically, due to the quoting in the CSV file (which is beyond my
>> control), I have had to resort to slicing the string (line 51) and then
>> formatting it (line 52).
>>
>> However the print out from the above attempt is as follows:
>>
>> "<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000'
>> <Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000
>>
>>    File "<string>", line unknown
>> ParseError: no element found: line 1, column 69
>> Interestingly - if I assign the variable "data" (as in line 54) I receive
>> this:
>>
>>    File "<ipython-input-80-7357c9272b92>", line 56
>> data = '<Response TransactionID="2"
>> RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
>>                       ^
>> SyntaxError: invalid token
>>
>> I seek feedback and information on how to address utilizing the most
>> Pythonic means to do so. Ideally, is there a method that can leverage
>> ElementTree. Thank you, in advance, for your feedback and guidance.
>>
>
> I don't understand because this line 54 gives:
>
> >>> import xml.etree.cElementTree as ElementTree
> >>> data = '<Response TransactionID="2"
> RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
>   File "<stdin>", line 1
>     data = '<Response TransactionID="2"
> RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
> ^
> SyntaxError: invalid syntax
>
>
> BUT IF you correct the string and remove the inner quote after 0000
> everything's fine:
> >>> data = '<Response TransactionID="2"
> RequestType="HoldInquiry"><ShareList>0000,0001,0070,
> </ShareList></Response>'
> >>> root = ElementTree.XML(data)
> >>> root
> <Element 'Response' at 0x7f0fb6dce330>
> Karim
>
>