Groups > comp.lang.python > #97970 > unrolled thread

Parsing adhoc structured data

Started by	Paul Moore <p.f.moore@gmail.com>
First post	2015-10-27 07:32 -0700
Last post	2015-10-29 06:58 -0400
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Parsing adhoc structured data Paul  Moore <p.f.moore@gmail.com> - 2015-10-27 07:32 -0700
    Re: Parsing adhoc structured data "TommyVee" <xxxxxxxx@xxxxxx.xxx> - 2015-10-29 06:58 -0400

#97970 — Parsing adhoc structured data

From	Paul Moore <p.f.moore@gmail.com>
Date	2015-10-27 07:32 -0700
Subject	Parsing adhoc structured data
Message-ID	<6449295d-95d0-4952-9567-c1c43d089aba@googlegroups.com>

I do a lot of data analysis with Python, and most of the time I have to work with structured data, but where the structure is typically undocumented, and/or unclear. So a lot of my time is spent trying to discern a structure from the data, so that I can work with it.

For example, a recent file I needed to parse looked something like this:

# Schema version - will update if the format changes
schema: 1

group: IT
type: Developer
installation:
package: Python
version: 3.4
installation:
package: Vim
installation:
package: Java
version: 8
type: Manager
...

So it's key: value, with nested blocks represented by indentation (but I can't be sure each indent is 4 spaces, at least not without scanning all the data and hoping new files don't break that rule). Some keys can have no value, and/or can be repeated multiple times.

It's got comments and blank lines (easy to handle) and a schema version, which would be nice if I could get a spec for the schema (which, of course, I can't, but at least I can see if my assumptions may no longer be valid...)

All I want to do in the first instance is to (relatively quickly) load the data into some sort of Python data structure, and then start analyzing it. For example, check the "installation" items to see what keys they contain, which keys are always present and which are optional, whether any are numeric, etc. Once I have that information, I can start putting the data into a more structured or manageable form (possibly a database, possibly simply a Python data structure that I'll serialise to disk for later analysis).

Every time I encounter data like this I find myself writing large amounts of either adhoc parser code (which usually takes me a couple of goes to get right :-() or way too much error handling checking for cases that end up never happening in practice.

This feels to me like a natural type of activity to use Python for (it's essentially a "data cleansing" type of activity) but I've had no luck finding libraries or tools that can make it easy.

Does anyone know of any good libraries for this type of "adhoc parsing" type of activity? Or any websites, articles or books discussing this sort of thing? I'm doing this as a sort of supporting activity to my "real" job, which is usually the data analysis (that I can't do till I get the data loaded!) so I'm not a particular expert on the best techniques for something like this.

Thanks,
Paul

[toc] | [next] | [standalone]

#97996

From	"TommyVee" <xxxxxxxx@xxxxxx.xxx>
Date	2015-10-29 06:58 -0400
Message-ID	<pZmYx.18502$I83.12020@fx04.iad>
In reply to	#97970

"Paul Moore"  wrote in message 
news:6449295d-95d0-4952-9567-c1c43d089aba@googlegroups.com...

I do a lot of data analysis with Python, and most of the time I have to work 
with structured data, but where the structure is typically undocumented, 
and/or unclear. So a lot of my time is spent trying to discern a structure 
from the data, so that I can work with it.

For example, a recent file I needed to parse looked something like this:

    # Schema version - will update if the format changes
    schema: 1

    group: IT
        type: Developer
            installation:
                package: Python
                version: 3.4
            installation:
                package: Vim
            installation:
                package: Java
                version: 8
        type: Manager
            ...

So it's key: value, with nested blocks represented by indentation (but I 
can't be sure each indent is 4 spaces, at least not without scanning all the 
data and hoping new files don't break that rule). Some keys can have no 
value, and/or can be repeated multiple times.

It's got comments and blank lines (easy to handle) and a schema version, 
which would be nice if I could get a spec for the schema (which, of course, 
I can't, but at least I can see if my assumptions may no longer be valid...)

All I want to do in the first instance is to (relatively quickly) load the 
data into some sort of Python data structure, and then start analyzing it. 
For example, check the "installation" items to see what keys they contain, 
which keys are always present and which are optional, whether any are 
numeric, etc. Once I have that information, I can start putting the data 
into a more structured or manageable form (possibly a database, possibly 
simply a Python data structure that I'll serialise to disk for later 
analysis).

Every time I encounter data like this I find myself writing large amounts of 
either adhoc parser code (which usually takes me a couple of goes to get 
right :-() or way too much error handling checking for cases that end up 
never happening in practice.

This feels to me like a natural type of activity to use Python for (it's 
essentially a "data cleansing" type of activity) but I've had no luck 
finding libraries or tools that can make it easy.

Does anyone know of any good libraries for this type of "adhoc parsing" type 
of activity? Or any websites, articles or books discussing this sort of 
thing? I'm doing this as a sort of supporting activity to my "real" job, 
which is usually the data analysis (that I can't do till I get the data 
loaded!) so I'm not a particular expert on the best techniques for something 
like this.

Thanks,
Paul

Not sure of any package to do it, but on first blush, I would use a 
recursive algorithm to create a dictionary (of nested dictionaries).  It 
looks like each section is a section followed by an imbedded list of 
subsections, something like this:

        section := [label: value [section]]

When you see a greater indent, recurse downward, when you see a lesser 
indent, pop back up.

[toc] | [prev] | [standalone]

csiph-web

Parsing adhoc structured data

Contents

#97970 — Parsing adhoc structured data

#97996