Groups > comp.lang.python > #92868 > unrolled thread

JSON Object to CSV File Troubleshooting

Started by	Sahlusar <ahlusar.ahluwalia@gmail.com>
First post	2015-06-18 18:47 -0700
Last post	2015-06-21 16:42 -0700
Articles	14 — 6 participants

Back to article view | Back to comp.lang.python

  JSON Object to CSV File Troubleshooting Sahlusar <ahlusar.ahluwalia@gmail.com> - 2015-06-18 18:47 -0700
    Re: JSON Object to CSV File Troubleshooting Steve Hayes <hayesstw@telkomsa.net> - 2015-06-19 07:23 +0200
    Re: JSON Object to CSV File Troubleshooting Denis McMahon <denismfmcmahon@gmail.com> - 2015-06-21 06:50 +0000
    Re: JSON Object to CSV File Troubleshooting Denis McMahon <denismfmcmahon@gmail.com> - 2015-06-21 06:53 +0000
      Re: JSON Object to CSV File Troubleshooting Sahlusar <sahluwalia@wynyardgroup.com> - 2015-06-21 06:58 -0700
      Re: JSON Object to CSV File Troubleshooting Sahlusar <ahlusar.ahluwalia@gmail.com> - 2015-06-21 07:38 -0700
        Re: JSON Object to CSV File Troubleshooting Denis McMahon <denismfmcmahon@gmail.com> - 2015-06-21 21:06 +0000
        Re: JSON Object to CSV File Troubleshooting Joonas Liik <liik.joonas@gmail.com> - 2015-06-22 00:55 +0300
          Re: JSON Object to CSV File Troubleshooting Denis McMahon <denismfmcmahon@gmail.com> - 2015-06-21 23:33 +0000
            Re: JSON Object to CSV File Troubleshooting Sahlusar <ahlusar.ahluwalia@gmail.com> - 2015-06-21 16:56 -0700
              Re: JSON Object to CSV File Troubleshooting Denis McMahon <denismfmcmahon@gmail.com> - 2015-06-23 06:34 +0000
                Re: JSON Object to CSV File Troubleshooting Sahlusar <ahlusar.ahluwalia@gmail.com> - 2015-06-23 04:32 -0700
            Re: JSON Object to CSV File Troubleshooting Chris Angelico <rosuav@gmail.com> - 2015-06-23 10:43 +1000
          Re: JSON Object to CSV File Troubleshooting Sahlusar <ahlusar.ahluwalia@gmail.com> - 2015-06-21 16:42 -0700

#92868 — JSON Object to CSV File Troubleshooting

From	Sahlusar <ahlusar.ahluwalia@gmail.com>
Date	2015-06-18 18:47 -0700
Subject	JSON Object to CSV File Troubleshooting
Message-ID	<d5bc266f-2e87-406f-b3f8-9c784112874b@googlegroups.com>

Good Evening,

I have a conundrum regarding JSON objects and converting them to CSV:

Context

I am converting XML files to a JSON object (please see snippet below) and then finally producing a CSV file. Here is a an example JSON object:

			"PAC": {
				"Account": [{
					"PC": "0",
					"CMC": "0",
					"WC": "0",
					"DLA": "0",
					"CN": null,
					"FC": {
						"Int32": ["0",
						"0",
						"0",
						"0",
						"0"]
					},
					"F": {
						"Description": null,
						"Code": "0"
					}

In general, when I convert any of the files from JSON to CSV, I have been successful when using the following strategy (credit to Peter Otten):


import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("somefileneame.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    pairs = list(flatten(data)) 

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) #writer.writerows for any other iterable object


However with the example JSON object (above) i receive the following error when applying this function:

ValueError: too many values to unpack 


Here are some more samples. 

"FC": {"Int32": ["0","0","0","0","0","0"]} 
"PBA": {"Double": ["0","0","0","0","0","0","0","0"]} 

3.          "PBDD": { 
                                                "DateTime": ["1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM", 
                                                "1/1/0001 12:00:00 AM"] 
                                        }, 



In the above example, I would like to remove the keys Int32, Double and DateTime. I am wondering if there is a function or methodology that would allow me to remove such nested keys and reassign the new keys to the outer key (in this case above FC, PBA and PBDD) as column headers in a CSV and concatenate all of the values within the list (as corresponding fields). 

Also, here is how I strategized my XML to CSV conversion (if this is of any use):


import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import json 
import csv
import tokenize
import token
try:
    from collections import OrderedDict
    import json
except ImportError:
    from ordereddict import OrderedDict
    import simplejson as json
import itertools
import six
import string
from csvkit import CSVKitWriter


class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            if element:
                # treat like dict
                if len(element) == 1 or element[0].tag != element[1].tag:
                    self.append(XmlDictConfig(element))
                # treat like list
                elif element[0].tag == element[1].tag:
                    self.append(XmlListConfig(element))
            elif element.text:
                text = element.text.strip()
                if text:
                    self.append(text)


class XmlDictConfig(dict):
    '''
    Example usage:

    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)

    Or, if you want to use an XML string:

    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)

    And then use xmldict for what it is..a dictionary.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            if element:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                if len(element) == 1 or element[0].tag != element[1].tag:
                    aDict = XmlDictConfig(element)
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)}
                # if the tag has attributes, add those to the dict
                if element.items():
                    aDict.update(dict(element.items()))
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})



def main():

    #Lines 88-89stantiate the class Elementree
    #and applies the method to recursively traverse from the root node
    #XmlDictConfig is instantiated in line 90

    with open('C:\\Users\\wynsa2\\Desktop\\Python Folder\\PCSU\\Trial2_PCSU\\2-Response.xml', 'r', encoding='utf-8') as f:
        xml_string = f.read()
    xml_string= xml_string.replace('&#x0;', '')
    root = ElementTree.XML(xml_string)
    xmldict = XmlDictConfig(root)
    json_str = json.dumps(xmldict, sort_keys=True, indent=4, separators=(',', ': '))
    newly_formatted_data = json.loads(json_str) #encode into JSON 
    with open('data2.json', 'w') as f:  #writing JSON file
        json.dump(newly_formatted_data, f) 



I hope that I was clear in my description. Thank you all for your help. 

Sincerely,
Saran

[toc] | [next] | [standalone]

#92870

From	Steve Hayes <hayesstw@telkomsa.net>
Date	2015-06-19 07:23 +0200
Message-ID	<gm97oa94ubm30c7euo3d42p60eae87utc6@4ax.com>
In reply to	#92868

On Thu, 18 Jun 2015 18:47:30 -0700 (PDT), Sahlusar
<ahlusar.ahluwalia@gmail.com> wrote:

>Good Evening,
>
>I have a conundrum regarding JSON objects and converting them to CSV:

That's the THIRD time you've asked this, in three separate threads.

Why don't you read the answers you were given the first time?

[follow-ups set]


-- 
Steve Hayes from Tshwane, South Africa
Web:  http://www.khanya.org.za/stevesig.htm
Blog: http://khanya.wordpress.com
E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

[toc] | [prev] | [next] | [standalone]

#92932

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-06-21 06:50 +0000
Message-ID	<mm5mr3$5m8$4@dont-email.me>
In reply to	#92868

On Thu, 18 Jun 2015 18:47:30 -0700, Sahlusar wrote:

> I have a conundrum regarding JSON objects and converting them to CSV:
> 
> Context
> 
> I am converting XML files to a JSON object (please see snippet below)
> and then finally producing a CSV file. Here is a an example JSON object:

This is where you're going wrong. If you want CSV data, take the XML and 
generate CSV data from it. Converting and writing it out to JSON and then 
reading it back and converting to CSV involves an extra conversion step 
where errors can creep in.

If you want to convert XML to CSV, go straight from XML to CSV, there is 
little added value in using some arbitrary intermediate format unless 
you're actually going to use the data in the intermediate format for 
something other than converting to the final format.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#92933

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-06-21 06:53 +0000
Message-ID	<mm5n0u$5m8$5@dont-email.me>
In reply to	#92868

On Thu, 18 Jun 2015 18:47:30 -0700, Sahlusar wrote:

> I have a conundrum regarding JSON objects and converting them to CSV:

I think your conundrum is that you've taken on a coding task beyond your 
abilities to comprehend, and as a result not only can you not code it, 
you can't even adequately describe it.

At least, it seems that every time you do try and describe it either the 
data format or the task description changes.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#92960

From	Sahlusar <sahluwalia@wynyardgroup.com>
Date	2015-06-21 06:58 -0700
Message-ID	<c18d65d8-d575-4711-a660-ef84bc796aad@googlegroups.com>
In reply to	#92933

On Sunday, 21 June 2015 02:54:48 UTC-4, Denis McMahon  wrote:
> On Thu, 18 Jun 2015 18:47:30 -0700, Sahlusar wrote:
> 
> > I have a conundrum regarding JSON objects and converting them to CSV:
> 
> I think your conundrum is that you've taken on a coding task beyond your 
> abilities to comprehend, and as a result not only can you not code it, 
> you can't even adequately describe it.
> 
> At least, it seems that every time you do try and describe it either the 
> data format or the task description changes.
> 
> -- 
> Denis McMahon, denismfmcmahon@gmail.com

It is difficult to explain this to someone asynchronously and without an in person meeting. Moreover, the strict guidelines for disclosing information make it difficult for me to explain the client's requirements and the problems that they face. 

I do agree with you Denis that this is an unconventional approach.

[toc] | [prev] | [next] | [standalone]

#92962

From	Sahlusar <ahlusar.ahluwalia@gmail.com>
Date	2015-06-21 07:38 -0700
Message-ID	<44a2d55f-9312-4b70-8b23-637466b21b11@googlegroups.com>
In reply to	#92933

It is difficult to explain this to someone asynchronously and without an in person meeting. Moreover, the strict guidelines for disclosing information make it difficult for me to explain the client's requirements and the problems that they face. 

I do agree with you Denis that this is an unconventional approach. I was wondering then that perhaps I should add additional functionality at the XML to JSON step? So far, with JSON objects without nested lists (as values) I have been successful with this (the following is rather lengthy):

import xml.etree.cElementTree as ElementTree 
from xml.etree.ElementTree import XMLParser 
import json 
import csv 
import tokenize 
import token 
try: 
    from collections import OrderedDict 
    import json 
except ImportError: 
    from ordereddict import OrderedDict 
    import simplejson as json 
import itertools 
import six 
import string 
from csvkit import CSVKitWriter 


class XmlListConfig(list): 
    def __init__(self, aList): 
        for element in aList: 
            if element: 
                # treat like dict 
                if len(element) == 1 or element[0].tag != element[1].tag: 
                    self.append(XmlDictConfig(element)) 
                # treat like list 
                elif element[0].tag == element[1].tag: 
                    self.append(XmlListConfig(element)) 
            elif element.text: 
                text = element.text.strip() 
                if text: 
                    self.append(text) 


class XmlDictConfig(dict): 
    ''' 
    Example usage: 

    >>> tree = ElementTree.parse('your_file.xml') 
    >>> root = tree.getroot() 
    >>> xmldict = XmlDictConfig(root) 

    Or, if you want to use an XML string: 

    >>> root = ElementTree.XML(xml_string) 
    >>> xmldict = XmlDictConfig(root) 

    And then use xmldict for what it is..a dictionary. 
    ''' 
    def __init__(self, parent_element): 
        if parent_element.items(): 
            self.update(dict(parent_element.items())) 
        for element in parent_element: 
            if element: 
                # treat like dict - we assume that if the first two tags 
                # in a series are different, then they are all different. 
                if len(element) == 1 or element[0].tag != element[1].tag: 
                    aDict = XmlDictConfig(element) 
                # treat like list - we assume that if the first two tags 
                # in a series are the same, then the rest are the same. 
                else: 
                    # here, we put the list in dictionary; the key is the 
                    # tag name the list elements all share in common, and 
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)} 
                # if the tag has attributes, add those to the dict 
                if element.items(): 
                    aDict.update(dict(element.items())) 
                self.update({element.tag: aDict}) 
            # this assumes that if you've got an attribute in a tag, 
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are 
            # currently doing XML configuration files... 
            elif element.items(): 
                self.update({element.tag: dict(element.items())}) 
            # finally, if there are no child tags and no attributes, extract 
            # the text 
            else: 
                self.update({element.tag: element.text}) 



def main(): 

    #Lines 88-89stantiate the class Elementree 
    #and applies the method to recursively traverse from the root node 
    #XmlDictConfig is instantiated in line 90 

    with open('C:\\Users\\somefile.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    xmldict = XmlDictConfig(root) 
    json_str = json.dumps(xmldict, sort_keys=True, indent=4, separators=(',', ': ')) 
    newly_formatted_data = json.loads(json_str) #encode into JSON 
    with open('data2.json', 'w') as f:  #writing JSON file 
        json.dump(newly_formatted_data, f) 

Peter Otten was very helpful with subsequently converting aJ SON string to a CSV file:


import csv 
import json 
import sys 

def hook(obj): 
    return obj 

def flatten(obj): 
    for k, v in obj: 
        if isinstance(v, list): 
            yield from flatten(v) 
        else: 
            yield k, v 

if __name__ == "__main__": 
    with open("somefileneame.json") as f: 
        data = json.load(f, object_pairs_hook=hook) 

    pairs = list(flatten(data)) 

    writer = csv.writer(sys.stdout) 
    header = writer.writerow([k for k, v in pairs]) 
    row = writer.writerow([v for k, v in pairs]) #writer.writerows for any other iterable object 

However, for the nested keys as dictionaries with values as dictionaries, I have been taking smaller chunks of the JSON objects and developing some functional programs that allow me to blow up lists:

For example for this example JSON snippet, to de-nest by level the nested dictionaries, I use this function:

def flatten(d, parent_key=''):
    items = []
    for k, v in d.items():
        try:
            items.extend(flatten(v, '%s%s_' % (parent_key, k)).items())
        except AttributeError:
            items.append(('%s%s' % (parent_key, k), v))
    return dict(items)
final = (flatten(data2, parent_key =''))


##JSON sample:

data2 = {
    "OTF": "0",
    "F": "False",
    "F": {
        "Int32": ["0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0",
        "0"]
    },
                "D": {
                    "B": ["0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },
                "PBDS": {
                    "DateTime": ["1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM",
                    "1/1/0001 12:00:00 AM"]
                },
                "PBDS": {
                    "Double": ["0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0",
                    "0"]
                },
                "SCS": {
                    "String": ["1",
                    "2"]
                }
            }

The result:


{'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
 'F_Int32': ['0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0'],
 'OTF': '0',
 'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
 'SCS_String': ['1', '2']}

I then have used a decorator to explode the lists (thank you to Gribouillis for feedback):

from postprocess import post_process
@post_process(dict)
def explode_lists(adict):
    for key, value in adict.items():
        if isinstance(value, list):
            if all(isinstance(x, str) for x in value):
                for i, x in enumerate(value, 1):
                    yield ('{}{}'.format(key, i), x)
                continue
        yield key, value


# postprocess.py
def post_process(*filters):
    """Decorator to post process a function's return value through a
    sequence of filters (functions with a single argument).
    Example:
        @post_process(f1, f2, f3)
        def f(*args, **kwd):
            ...
            return value
        then calling f(...) will actually return f3( f2( f1( f(...)))).
        This can also be used to convert a generator to a function
        returning a sequence type:
        @post_process(dict)
        def my_generator():
            ...
            yield key, value
    """
    def decorate(func):
        from functools import wraps
        @wraps(func)
        def wrapper(*args, **kwd):
            rv = func(*args, **kwd)
            for f in filters:
                rv = f(rv)
            return rv
        return wrapper
return decorate

I know that this is alot of sequential steps. I am wondering if I could insert or conditionally pass these functions when originally parsing the XML, so that the JSON is formatted for more recursive reading of the JSON dictionary and then writing to CSV? I welcome constructive feedback for refactoring....

[toc] | [prev] | [next] | [standalone]

#92976

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-06-21 21:06 +0000
Message-ID	<mm78vs$n8l$2@dont-email.me>
In reply to	#92962

On Sun, 21 Jun 2015 07:38:13 -0700, Sahlusar wrote:

> It is difficult to explain this to someone asynchronously and without an
> in person meeting. Moreover, the strict guidelines for disclosing
> information make it difficult for me to explain the client's
> requirements and the problems that they face.
> 
> I do agree with you Denis that this is an unconventional approach. I was
> wondering then that perhaps I should add additional functionality at the
> XML to JSON step? So far, with JSON objects without nested lists (as
> values) I have been successful with this (the following is rather
> lengthy):

No, step back and force yourself to answer these questions:

Why use JSON as an intermediate step?
What benefit does using JSON as an intermediate step bring me?

I see no evidence in any of your posts that the use of JSON as an 
intermediate format for the data brings any benefit whatsoever, however I 
have seen evidence that it may be introducing errors and potential data 
loss, and it is certainly adding coding complexity.

None of these are good reasons to do it, and all of them are good reasons 
not to do it.

If your data is in XML and your requirement is for CSV, then you should 
be converting from XML to CSV.

Also stop posting reams of code. No-one is reading it. If you have a 
specific error you need to fix, then post a shortest possible example of 
code that generates the error. This should never be more than about 10 
lines.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#92979

From	Joonas Liik <liik.joonas@gmail.com>
Date	2015-06-22 00:55 +0300
Message-ID	<mailman.685.1434923721.13271.python-list@python.org>
In reply to	#92962

On 21 June 2015 at 17:38, Sahlusar <ahlusar.ahluwalia@gmail.com> wrote:
>
> [snip]
> I do agree with you Denis that this is an unconventional approach. I was wondering then that perhaps I should add additional functionality at the XML to JSON step? So far, with JSON objects without nested lists (as values) I have been successful with this (the following is rather lengthy):
> [snip]


> ##JSON sample:
>
> data2 = {
>     "OTF": "0",
>     "F": "False",
>     "F": {
>         "Int32": ["0",
>         "0",
>         "0",
>         "0"]
>     },
> [snip]
>                 "PBDS": {
>                     "DateTime": ["1/1/0001 12:00:00 AM",
>                     "1/1/0001 12:00:00 AM",
>                     "1/1/0001 12:00:00 AM",
>                     "1/1/0001 12:00:00 AM"]
>                 },
>                 "PBDS": {
>                     "Double": ["0",
>                     "0",
>                     "0"]
>                 },
>                 "SCS": {
>                     "String": ["1",
>                     "2"]
>                 }
>             }
>
> The result:

and compare those closely now....

>
> {'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
>  'F_Int32': ['0',
>   '0',
>   '0',
>   '0'],
>  'OTF': '0',
>  'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
>  'SCS_String': ['1', '2']}
>
Notice in the original text you have 2 entries under the name F and
later 2 entiries under the name PBDS. in the result you are missing
the first entry of each.
you say you have succeeded in generating json, unless you meant to
throw away huge swafts of data i would say... nope..



[snip]
>
> I know that this is alot of sequential steps. I am wondering if I could insert or conditionally pass these functions when originally parsing the XML, so that the JSON is formatted for more recursive reading of the JSON dictionary and then writing to CSV? I welcome constructive feedback for refactoring....

theres things you could do to fix up the generated json .. tho really,

stop generating json when you need to generate csv.
you are winning nothing. you are losing.. well pretty much .. a little
of everything .. by doing this

there are fundemental properties of xml and json you fail to grasp,
you are touting code claiming that it works when the output it
produces is horribly deformed :(

In xml for instance this is valid:

<a>
 <b>1</b>
</a>
.. and so is this:
<a>
 <b>1</b>
 <b>2</b>
</a>

a naive translatio n of the first might yield
{"a":
 {"b":1}
}
but this will not work with the second example, it would emit
{"a":
 {"b":1,"b":2}
}
which really means
{"a":
 {"b":2}
}

if you insist on emitting json as an intermediate step you need to
take care of these inconsistencies somehow.
you need to decide which behaviour you want and be explicit about it.
is it desireable that the last entry overrites the previous one? (you
have this now, i doubt this is what you want)
would you like some mergine behaviour? (some config file might work
well with this, or not)
would you like to have every entry be a list? (this is simple, but you
will end up with a lot of junk like {a:[{b:[1]}]}
do you wrap some things in list but not others?

and the conversion from json to CSV has similar issues ofc.

[toc] | [prev] | [next] | [standalone]

#92984

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-06-21 23:33 +0000
Message-ID	<mm7hjt$n8l$4@dont-email.me>
In reply to	#92979

On Mon, 22 Jun 2015 00:55:11 +0300, Joonas Liik wrote:

> In xml for instance this is valid:

> <a>
>  <b>1</b>
> </a>
> .. and so is this:
> <a>
>  <b>1</b> <b>2</b>
> </a>

What the OP needs to do is sit down with the XML and work out how it 
needs to be represented in CSV terms, and then code that transformation, 
but it appears that he's not listening.

ie, does your xml:

<a>
  <b>1</b> <b>2</b>
</a>

translate to CSV:

"a","b"   // headers
"b",1     // data row 1
"b",2     // data row 2

or to CSV:

"a", "b", "b"     // headers
"", 1, 2          // data row

or even CSV:

"b"   // headers
1     // data row 1
2     // data row 2

If he can't codify that in a consistent manner across all the XML he 
wishes to process, then he really does need to find someone competent to 
do the job instead of wallowing around in json until the client gives up 
in despair at the lack of progress and finds someone else to do the job.

This should really have been defined by whoever set the task to do the 
conversion. If the job is to convert from some XML DTD to a CSV format, 
then there should be a clear description of what extracts from the XML 
are expected to be in which positions in the CSV.

This is the sort of data conversion code I generally turn out in a day or 
so, it's hardly rocket science as long as you have a clear description of 
what is required. If you don't have a clear description of what is 
required, you have to keep asking questions until you get one.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#92988

From	Sahlusar <ahlusar.ahluwalia@gmail.com>
Date	2015-06-21 16:56 -0700
Message-ID	<411541bf-7da6-42c1-b332-2761494d29ba@googlegroups.com>
In reply to	#92984

On Sunday, June 21, 2015 at 7:34:47 PM UTC-4, Denis McMahon wrote:
> On Mon, 22 Jun 2015 00:55:11 +0300, Joonas Liik wrote:
> 
> > In xml for instance this is valid:
> 
> > <a>
> >  <b>1</b>
> > </a>
> > .. and so is this:
> > <a>
> >  <b>1</b> <b>2</b>
> > </a>
> 
> What the OP needs to do is sit down with the XML and work out how it 
> needs to be represented in CSV terms, and then code that transformation, 
> but it appears that he's not listening.
> 
> ie, does your xml:
> 
> <a>
>   <b>1</b> <b>2</b>
> </a>
> 
> translate to CSV:
> 
> "a","b"   // headers
> "b",1     // data row 1
> "b",2     // data row 2
> 
> or to CSV:
> 
> "a", "b", "b"     // headers
> "", 1, 2          // data row
> 
> or even CSV:
> 
> "b"   // headers
> 1     // data row 1
> 2     // data row 2
> 
> If he can't codify that in a consistent manner across all the XML he 
> wishes to process, then he really does need to find someone competent to 
> do the job instead of wallowing around in json until the client gives up 
> in despair at the lack of progress and finds someone else to do the job.
> 
> This should really have been defined by whoever set the task to do the 
> conversion. If the job is to convert from some XML DTD to a CSV format, 
> then there should be a clear description of what extracts from the XML 
> are expected to be in which positions in the CSV.
> 
> This is the sort of data conversion code I generally turn out in a day or 
> so, it's hardly rocket science as long as you have a clear description of 
> what is required. If you don't have a clear description of what is 
> required, you have to keep asking questions until you get one.
> 
> -- 
> Denis McMahon, denismfmcmahon@gmail.com



On Sunday, June 21, 2015 at 7:34:47 PM UTC-4, Denis McMahon wrote:
> On Mon, 22 Jun 2015 00:55:11 +0300, Joonas Liik wrote:
> 
> > In xml for instance this is valid:
> 
> > <a>
> >  <b>1</b>
> > </a>
> > .. and so is this:
> > <a>
> >  <b>1</b> <b>2</b>
> > </a>
> 
> What the OP needs to do is sit down with the XML and work out how it 
> needs to be represented in CSV terms, and then code that transformation, 
> but it appears that he's not listening.
> 
> ie, does your xml:
> 
> <a>
>   <b>1</b> <b>2</b>
> </a>
> 
> translate to CSV:
> 
> "a","b"   // headers
> "b",1     // data row 1
> "b",2     // data row 2
> 
> or to CSV:
> 
> "a", "b", "b"     // headers
> "", 1, 2          // data row
> 
> or even CSV:
> 
> "b"   // headers
> 1     // data row 1
> 2     // data row 2
> 
> If he can't codify that in a consistent manner across all the XML he 
> wishes to process, then he really does need to find someone competent to 
> do the job instead of wallowing around in json until the client gives up 
> in despair at the lack of progress and finds someone else to do the job.
> 
> This should really have been defined by whoever set the task to do the 
> conversion. If the job is to convert from some XML DTD to a CSV format, 
> then there should be a clear description of what extracts from the XML 
> are expected to be in which positions in the CSV.
> 
> This is the sort of data conversion code I generally turn out in a day or 
> so, it's hardly rocket science as long as you have a clear description of 
> what is required. If you don't have a clear description of what is 
> required, you have to keep asking questions until you get one.
> 
> -- 
> Denis McMahon, denismfmcmahon@gmail.com



On Sunday, June 21, 2015 at 7:34:47 PM UTC-4, Denis McMahon wrote:
> On Mon, 22 Jun 2015 00:55:11 +0300, Joonas Liik wrote:
> 
> > In xml for instance this is valid:
> 
> > <a>
> >  <b>1</b>
> > </a>
> > .. and so is this:
> > <a>
> >  <b>1</b> <b>2</b>
> > </a>
> 
> What the OP needs to do is sit down with the XML and work out how it 
> needs to be represented in CSV terms, and then code that transformation, 
> but it appears that he's not listening.
> 
> ie, does your xml:
> 
> <a>
>   <b>1</b> <b>2</b>
> </a>
> 
> translate to CSV:
> 
> "a","b"   // headers
> "b",1     // data row 1
> "b",2     // data row 2
> 
> or to CSV:
> 
> "a", "b", "b"     // headers
> "", 1, 2          // data row
> 
> or even CSV:
> 
> "b"   // headers
> 1     // data row 1
> 2     // data row 2
> 
> If he can't codify that in a consistent manner across all the XML he 
> wishes to process, then he really does need to find someone competent to 
> do the job instead of wallowing around in json until the client gives up 
> in despair at the lack of progress and finds someone else to do the job.
> 
> This should really have been defined by whoever set the task to do the 
> conversion. If the job is to convert from some XML DTD to a CSV format, 
> then there should be a clear description of what extracts from the XML 
> are expected to be in which positions in the CSV.
> 
> This is the sort of data conversion code I generally turn out in a day or 
> so, it's hardly rocket science as long as you have a clear description of 
> what is required. If you don't have a clear description of what is 
> required, you have to keep asking questions until you get one.
> 
> -- 
> Denis McMahon, denismfmcmahon@gmail.com


I have gone back to the drawing board and scrapped that idea of an intermediate JSON/dictionary. 


This is the output that I seek:

> "a","b"   // headers
> "b",1     // data row 1
> "b",2     // data row 2
> 

Here is an example XML document that I am working with:

<Response ID="24856-775" RequestType="Moverview">        
        <MonthDayCount>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
            <Int32>0</Int32>
        </MonthDayCount>
        <Warnings />
        <SList />
        <LList />
        <EA>Y</EA>
        <EHA>Y</EHA>
        <EBY>Y</EBY>
        <EOTH>Y</EOTH>
        <EIL>Y</EIL>
        <EM>Y</EM>
        <ED>Y</ED>
        <EQ>Y</EQ>
        <ERS>Y</ERS>
        <ECCS>Y</ECCS>
        <EES>Y</EES>
        <UAS>Y</UAS>
        <PA>False</PA>
        <PL>False</PL>
        <PC>False</PC>
        <PCs>False</PCs>
        <PJ>False</PJ>
        <OITC>0</OITC>
        <MG />
        <R />
        <CCGoods />
    </MO>
</Response>


So far I have gotten this output:

""" my output -->
('Response.RequestType', 'Moverview')
('Response.ID', '24856-775')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.MonthDayCount.Int32.', '0')
('Response.Warnings.', None)
('Response.SList.', None)
('Response.LList.', None)
('Response.EA.', 'Y')
('Response.EHA.', 'Y')
('Response.EBY.', 'Y')
('Response.EOTH.', 'Y')
('Response.EIL.', 'Y')
('Response.EM.', 'Y')
('Response.ED.', 'Y')
('Response.EQ.', 'Y')
('Response.ERS.', 'Y')
('Response.ECCS.', 'Y')
('Response.EES.', 'Y')
('Response.UAS.', 'Y')
('Response.PA.', 'False')
('Response.PL.', 'False')
('Response.PC.', 'False')
('Response.PCs.', 'False')
('Response.PJ.', 'False')
('Response.OITC.', '0')
('Response.MG.', None)
('Response.R.', None)
('Response.CCGoods.', None)
"""

The function that produced this has been redesigned to flatten the XML using a generator:

import xml.etree.cElementTree as ElementTree 
from xml.etree.ElementTree import XMLParser 
import json 
import csv 
import tokenize 
import token 
try: 
    from collections import OrderedDict 
    import json 
except ImportError: 
    from ordereddict import OrderedDict 
    import simplejson as json 
import itertools 
import six 
import string 
#from csvkit import CSVKitWriter 
def flatten_list(aList, prefix=''): 
    for element in aList:
        if element: 
            # treat like dict 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix)
            # treat like list 
            elif element[0].tag == element[1].tag: 
                yield from flatten_list(element, prefix)
        elif element.text: 
            text = element.text.strip() 
            if text: 
                yield prefix, text
def flatten_dict(parent_element, prefix=''):
    prefix = prefix + parent_element.tag + '.'
    if parent_element.items():
        for k, v in parent_element.items():
            yield prefix + k, v
    for element in parent_element:
        eprefix = prefix + element.tag + '.'
        if element:
            # treat like dict - we assume that if the first two tags 
            # in a series are different, then they are all different. 
            if len(element) == 1 or element[0].tag != element[1].tag: 
                yield from flatten_dict(element, prefix=prefix)
            # treat like list - we assume that if the first two tags 
            # in a series are the same, then the rest are the same. 
            else: 
                # here, we put the list in dictionary; the key is the 
                # tag name the list elements all share in common, and 
                # the value is the list itself
                yield from flatten_list(element, prefix=eprefix+element[0].tag+'.')
            # if the tag has attributes, add those to the dict
            if element.items():
                for k, v in element.items():
                    yield eprefix+k, v 
        # this assumes that if you've got an attribute in a tag, 
        # you won't be having any text. This may or may not be a 
        # good idea -- time will tell. It works for the way we are 
        # currently doing XML configuration files... 
        elif element.items(): 
            for k, v in element.items():
                yield eprefix+k, v
        # finally, if there are no child tags and no attributes, extract 
        # the text 
        else:
            yield eprefix, element.text
def main():
    with open('source.xml', 'r', encoding='utf-8') as f: 
        xml_string = f.read() 
    xml_string= xml_string.replace('&#x0;', '') #optional to remove ampersands. 
    root = ElementTree.XML(xml_string) 
    for item in flatten_dict(root):
        print(item)
if __name__ == "__main__":
    main()


I now seek to work on the following. 

1.

(**'Response.RequestType'**, 'Moverview') 
(**'Response.ID', **'24856-775') 

The ** 'sample.text.in_tag ** corresponds with the column headers. The values would be the second item in the tuple.

All of the "Int32" should be stripped away and replaced with an enumeration number corresponding with an index for the header "MonthDayCount" (for example MonthDayCount_1, MonthDayCount_2, MonthDayCount_3, MonthDayCount_4 etc)


The Responses for the rest of the elements (save for the first two above (under item number 1) could be removed). I am still trying to figure how to do that as well...

Again The first and second item in the tuple would be the header and corresponding value, respectively - just like your initial assumption. 

You are welcome to contribute and provide me with feedback. Thank you for your continued feedback and guidance.

[toc] | [prev] | [next] | [standalone]

#93029

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-06-23 06:34 +0000
Message-ID	<mmaul7$a7t$1@dont-email.me>
In reply to	#92988

On Sun, 21 Jun 2015 16:56:27 -0700, Sahlusar wrote:

> Here is an example XML document that I am working with:

> You are welcome to contribute and provide me with feedback. Thank you
> for your continued feedback and guidance.

Your XML is invalid! You have a closing MO tag with no opening tag.

> This is the output that I seek:

> > "a","b"   // headers 
> > "b",1     // data row 1 
> > "b",2     // data row 2

I assume from this that where you have a list of identical elements at 
the deepest nesting level of the XML file, you require one CSV record for 
each element in that list, and that you want all higher level element 
values and attributes duplicated in each CSV record.

I assume that you want to use the tag name of each element as the 
identifier for the text content of the element (where there is any), and 
that where an element has no text content, a 0 value is appropriate.

I also assume that as long as the relationship between the headers and 
the data is correct, it does not matter what order the headers are in, ie 
the data and headers can both be sorted by the header.

The output of my code generated once I had corrected your broken XML by 
inserting an 'MO' opening tag between the 'Response' opening tag and the 
'MonthDayCount' opening tag can be seen at:

http://www.sined.co.uk/tmp/xml_to_csv.txt

This was generated from xml file:

http://www.sined.co.uk/tmp/xml_data.txt

If you want the code that produced it, we can discuss fees, it took a few 
hours and for consultancy like this I expect a few 10s of $ per hour.

There may be a generic method to do what you want involving parsing the 
xml to a nested dictionary / list data object, and then flattening that 
object, but I don't see that generating you one line of CSV for each 
Int32 in MonthDayCount.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#93036

From	Sahlusar <ahlusar.ahluwalia@gmail.com>
Date	2015-06-23 04:32 -0700
Message-ID	<f55f25d7-6643-4163-87e2-029e8877840a@googlegroups.com>
In reply to	#93029

On Tuesday, June 23, 2015 at 2:35:50 AM UTC-4, Denis McMahon wrote:
> On Sun, 21 Jun 2015 16:56:27 -0700, Sahlusar wrote:
> 
> > Here is an example XML document that I am working with:
> 
> > You are welcome to contribute and provide me with feedback. Thank you
> > for your continued feedback and guidance.
> 
> Your XML is invalid! You have a closing MO tag with no opening tag.

Thank you Denis for reiterating this; I am already well aware that my XML document has arbitrarily placed tags.


> 
> > This is the output that I seek:
> 
> > > "a","b"   // headers 
> > > "b",1     // data row 1 
> > > "b",2     // data row 2
> 
> I assume from this that where you have a list of identical elements at 
> the deepest nesting level of the XML file, you require one CSV record for 
> each element in that list, and that you want all higher level element 
> values and attributes duplicated in each CSV record.

Almost, the items in the list should have separate "headers" associated with each tag (and enumerated):

MonthDayCount1, MonthDayCount2, MonthDayCount3, MonthDayCount4 etc....

0                      0              0             0

> 
> I assume that you want to use the tag name of each element as the 
> identifier for the text content of the element (where there is any), and 
> that where an element has no text content, a 0 value is appropriate.

Yes, the tag name is the "key", so to speak. 

> 
> I also assume that as long as the relationship between the headers and 
> the data is correct, it does not matter what order the headers are in, ie 
> the data and headers can both be sorted by the header.
> 
> The output of my code generated once I had corrected your broken XML by 
> inserting an 'MO' opening tag between the 'Response' opening tag and the 
> 'MonthDayCount' opening tag can be seen at:
> 
> http://www.sined.co.uk/tmp/xml_to_csv.txt
> 
> This was generated from xml file:
> 
> http://www.sined.co.uk/tmp/xml_data.txt
> 
> If you want the code that produced it, we can discuss fees, it took a few 
> hours and for consultancy like this I expect a few 10s of $ per hour.


I already have solved this and produced a CSV that meets the clients proposed (yet amorphous requirements). As ChrisA pointed out, programmers should have an intermediate person who could potentially serve as a translator between the client and developer. Unfortunately, this is an ideal that is stifled by labor economics. 

> 
> There may be a generic method to do what you want involving parsing the 
> xml to a nested dictionary / list data object, and then flattening that 
> object, but I don't see that generating you one line of CSV for each 
> Int32 in MonthDayCount.


I have done just that. Thanks again for your help. 

> 
> -- 
> Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#93022

From	Chris Angelico <rosuav@gmail.com>
Date	2015-06-23 10:43 +1000
Message-ID	<mailman.715.1435020220.13271.python-list@python.org>
In reply to	#92984

On Mon, Jun 22, 2015 at 9:33 AM, Denis McMahon <denismfmcmahon@gmail.com> wrote:
> This is the sort of data conversion code I generally turn out in a day or
> so, it's hardly rocket science as long as you have a clear description of
> what is required. If you don't have a clear description of what is
> required, you have to keep asking questions until you get one.

And this all before you even write a line of code. Getting the specs
down is simply a matter of taking an input file and hand-crafting an
ideal output file. One of my brothers claims he isn't a programmer
(though he's better than he lets on), but when it comes to figuring
iout a coherent spec, he's awesome... if I could put someone like him
between the programmer and the end user, it'd save so much time.

ChrisA

[toc] | [prev] | [next] | [standalone]

#92986

From	Sahlusar <ahlusar.ahluwalia@gmail.com>
Date	2015-06-21 16:42 -0700
Message-ID	<ef21afd1-354b-40ae-b8a4-d5e88c5796e2@googlegroups.com>
In reply to	#92979

On Sunday, June 21, 2015 at 5:56:00 PM UTC-4, Waffle wrote:
> On 21 June 2015 at 17:38, Sahlusar <ahlusar.ahluwalia@gmail.com> wrote:
> >
> > [snip]
> > I do agree with you Denis that this is an unconventional approach. I was wondering then that perhaps I should add additional functionality at the XML to JSON step? So far, with JSON objects without nested lists (as values) I have been successful with this (the following is rather lengthy):
> > [snip]
> 
> 
> > ##JSON sample:
> >
> > data2 = {
> >     "OTF": "0",
> >     "F": "False",
> >     "F": {
> >         "Int32": ["0",
> >         "0",
> >         "0",
> >         "0"]
> >     },
> > [snip]
> >                 "PBDS": {
> >                     "DateTime": ["1/1/0001 12:00:00 AM",
> >                     "1/1/0001 12:00:00 AM",
> >                     "1/1/0001 12:00:00 AM",
> >                     "1/1/0001 12:00:00 AM"]
> >                 },
> >                 "PBDS": {
> >                     "Double": ["0",
> >                     "0",
> >                     "0"]
> >                 },
> >                 "SCS": {
> >                     "String": ["1",
> >                     "2"]
> >                 }
> >             }
> >
> > The result:
> 
> and compare those closely now....
> 
> >
> > {'D_B': ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
> >  'F_Int32': ['0',
> >   '0',
> >   '0',
> >   '0'],
> >  'OTF': '0',
> >  'PBDS_Double': ['0', '0', '0', '0', '0', '0', '0', '0'],
> >  'SCS_String': ['1', '2']}
> >
> Notice in the original text you have 2 entries under the name F and
> later 2 entiries under the name PBDS. in the result you are missing
> the first entry of each.
> you say you have succeeded in generating json, unless you meant to
> throw away huge swafts of data i would say... nope..
> 
> 
> 
> [snip]
> >
> > I know that this is alot of sequential steps. I am wondering if I could insert or conditionally pass these functions when originally parsing the XML, so that the JSON is formatted for more recursive reading of the JSON dictionary and then writing to CSV? I welcome constructive feedback for refactoring....
> 
> theres things you could do to fix up the generated json .. tho really,
> 
> stop generating json when you need to generate csv.
> you are winning nothing. you are losing.. well pretty much .. a little
> of everything .. by doing this
> 
> there are fundemental properties of xml and json you fail to grasp,
> you are touting code claiming that it works when the output it
> produces is horribly deformed :(
> 
> In xml for instance this is valid:
> 
> <a>
>  <b>1</b>
> </a>
> .. and so is this:
> <a>
>  <b>1</b>
>  <b>2</b>
> </a>
> 
> a naive translatio n of the first might yield
> {"a":
>  {"b":1}
> }
> but this will not work with the second example, it would emit
> {"a":
>  {"b":1,"b":2}
> }
> which really means
> {"a":
>  {"b":2}
> }
> 
> if you insist on emitting json as an intermediate step you need to
> take care of these inconsistencies somehow.
> you need to decide which behaviour you want and be explicit about it.
> is it desireable that the last entry overrites the previous one? (you
> have this now, i doubt this is what you want)
> would you like some mergine behaviour? (some config file might work
> well with this, or not)
> would you like to have every entry be a list? (this is simple, but you
> will end up with a lot of junk like {a:[{b:[1]}]}
> do you wrap some things in list but not others?
> 
> and the conversion from json to CSV has similar issues ofc.


I am rethinking my strategy and going back to the drawing board. The data is malformed. I am quite familiar with JSON and XML. I am trying to work within the parameters given to me. I will rethink this. Thank you for giving me a reality check with a splash of water. Sometimes it is hard to get one's head out of the ground....especially when one is new to enterprise and commercial development.

[toc] | [prev] | [standalone]

csiph-web

JSON Object to CSV File Troubleshooting

Contents

#92868 — JSON Object to CSV File Troubleshooting

#92870

#92932

#92933

#92960

#92962

#92976

#92979

#92984

#92988

#93029

#93036

#93022

#92986