Groups > comp.lang.python > #12414 > unrolled thread

Help parsing a text file

Started by	William Gill <noreply@domain.invalid>
First post	2011-08-29 14:21 -0400
Last post	2011-09-01 14:38 -0400
Articles	8 — 7 participants

Back to article view | Back to comp.lang.python

  Help parsing a text file William Gill <noreply@domain.invalid> - 2011-08-29 14:21 -0400
    Re: Help parsing a text file Philip Semanchuk <philip@semanchuk.com> - 2011-08-29 14:31 -0400
      Re: Help parsing a text file William Gill <nospam@domain.invalid> - 2011-08-29 14:56 -0400
    Re: Help parsing a text file Thomas Jollans <t@jollybox.de> - 2011-08-29 23:05 +0200
      Re: Help parsing a text file "Waldek M." <wm@localhost.localdomain> - 2011-08-30 13:50 +0200
    Re: Help parsing a text file Tim Roberts <timr@probo.com> - 2011-08-30 22:37 -0700
    Re: Help parsing a text file JT <james.thornton@gmail.com> - 2011-09-01 10:58 -0700
      Re: Help parsing a text file William Gill <nospam@domain.invalid> - 2011-09-01 14:38 -0400

#12414 — Help parsing a text file

From	William Gill <noreply@domain.invalid>
Date	2011-08-29 14:21 -0400
Subject	Help parsing a text file
Message-ID	<j3glai$1mu$1@dont-email.me>

I haven't done much with Python for a couple years, bouncing around 
between other languages and scripts as needs suggest, so I have some 
minor difficulty keeping Python functionality Python functionality in my 
head, but I can overcome that as the cobwebs clear.  Though I do seem to 
keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).

I have a text file with XML like records that I need to parse.  By XML 
like I mean records have proper opening and closing tags. but fields 
don't have closing tags (they rely on line ends).  Not all fields appear 
in all records, but they do adhere to a defined sequence.

My initial passes into Python have been very unfocused (a scatter gun of 
too many possible directions, yielding very messy results), so I'm 
asking for some suggestions, or algorithms (possibly even examples)that 
may help me focus.

I'm not asking anyone to write my code, just to nudge me toward a more 
disciplined approach to a common task, and I promise to put in the 
effort to understand the underlying fundamentals.

[toc] | [next] | [standalone]

#12416

From	Philip Semanchuk <philip@semanchuk.com>
Date	2011-08-29 14:31 -0400
Message-ID	<mailman.552.1314642679.27778.python-list@python.org>
In reply to	#12414

On Aug 29, 2011, at 2:21 PM, William Gill wrote:

> I haven't done much with Python for a couple years, bouncing around between other languages and scripts as needs suggest, so I have some minor difficulty keeping Python functionality Python functionality in my head, but I can overcome that as the cobwebs clear.  Though I do seem to keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).
> 
> I have a text file with XML like records that I need to parse.  By XML like I mean records have proper opening and closing tags. but fields don't have closing tags (they rely on line ends).  Not all fields appear in all records, but they do adhere to a defined sequence.
> 
> My initial passes into Python have been very unfocused (a scatter gun of too many possible directions, yielding very messy results), so I'm asking for some suggestions, or algorithms (possibly even examples)that may help me focus.
> 
> I'm not asking anyone to write my code, just to nudge me toward a more disciplined approach to a common task, and I promise to put in the effort to understand the underlying fundamentals.

If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.


Cheers
Philip

[toc] | [prev] | [next] | [standalone]

#12418

From	William Gill <nospam@domain.invalid>
Date	2011-08-29 14:56 -0400
Message-ID	<j3gnc0$5t1$1@dont-email.me>
In reply to	#12416

On 8/29/2011 2:31 PM, Philip Semanchuk wrote:
>
> If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.
>

Possibly, but I would still need the same search algorithms to find the 
opening tag for the field, then find and replace the next line end with 
a matching closing tag.  So it seems to me that the starting point is 
the same, and then it's my choice to either process the substrings 
myself or employ something like ElementTree.

[toc] | [prev] | [next] | [standalone]

#12420

From	Thomas Jollans <t@jollybox.de>
Date	2011-08-29 23:05 +0200
Message-ID	<mailman.556.1314651909.27778.python-list@python.org>
In reply to	#12414

On 29/08/11 20:21, William Gill wrote:
> I haven't done much with Python for a couple years, bouncing around
> between other languages and scripts as needs suggest, so I have some
> minor difficulty keeping Python functionality Python functionality in my
> head, but I can overcome that as the cobwebs clear.  Though I do seem to
> keep tripping over the same Py2 -> Py3 syntax changes (old habits die
> hard).
> 
> I have a text file with XML like records that I need to parse.  By XML
> like I mean records have proper opening and closing tags. but fields
> don't have closing tags (they rely on line ends).  Not all fields appear
> in all records, but they do adhere to a defined sequence.
> 
> My initial passes into Python have been very unfocused (a scatter gun of
> too many possible directions, yielding very messy results), so I'm
> asking for some suggestions, or algorithms (possibly even examples)that
> may help me focus.
> 
> I'm not asking anyone to write my code, just to nudge me toward a more
> disciplined approach to a common task, and I promise to put in the
> effort to understand the underlying fundamentals.

A name that is often thrown around on this list for this kind of
question is pyparsing. Now, I don't know anything about it myself, but
it may be worth looking into.

Otherwise, if you say it's similar to XML, you might want to take a cue
from XML processing when it comes to dealing with the file. You could
emulate the stream-based approach taken by SAX or eXpat - have methods
that handle the different events that can occur - for XML this is "start
tag", "end tag", "text node", "processing instruction", etc., in your
case, it might be "start/end record", "field data", etc. That way, you
could separate the code that keeps track of the current record, and how
the data fits together to make an object structure, and the parsing
code, that knows how to convert a line of data into something meaningful.

Thomas

[toc] | [prev] | [next] | [standalone]

#12436

From	"Waldek M." <wm@localhost.localdomain>
Date	2011-08-30 13:50 +0200
Message-ID	<1mpbiyq718zub.dlg@localhost.localdomain>
In reply to	#12420

On Mon, 29 Aug 2011 23:05:23 +0200, Thomas Jollans wrote:
> A name that is often thrown around on this list for this kind of
> question is pyparsing. Now, I don't know anything about it myself, but
> it may be worth looking into.

Definitely. I did use it and even though it's not perfect - it's very
useful indeed. Due to it's nature it is not a demon of speed when parsing
complex and big structures, so you might want to keep it in mind.
But I whole-heartedly recommend it.

Br.
Waldek

[toc] | [prev] | [next] | [standalone]

#12461

From	Tim Roberts <timr@probo.com>
Date	2011-08-30 22:37 -0700
Message-ID	<a0ir57hq08nfegtsbdma5rnom0pg6mignv@4ax.com>
In reply to	#12414

William Gill <noreply@domain.invalid> wrote:
>
>My initial passes into Python have been very unfocused (a scatter gun of 
>too many possible directions, yielding very messy results), so I'm 
>asking for some suggestions, or algorithms (possibly even examples)that 
>may help me focus.
>
>I'm not asking anyone to write my code, just to nudge me toward a more 
>disciplined approach to a common task, and I promise to put in the 
>effort to understand the underlying fundamentals.

Python includes "sgmllib", which was designed to parse SGML-based files,
including both neat XML and slimy HTML, and "htmllib", which derives from
it.  I have used "htmllib" to parse HTML files where the tags were not
properly closed.  Perhaps you could start from "htmllib" and modify it to
handle the quirks in your particular format.
-- 
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

[toc] | [prev] | [next] | [standalone]

#12574

From	JT <james.thornton@gmail.com>
Date	2011-09-01 10:58 -0700
Message-ID	<015c1b3a-947f-4e11-8b1b-6f4ae52c31fd@glegroupsg2000goo.googlegroups.com>
In reply to	#12414

On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote:
> 
> I have a text file with XML like records that I need to parse.  By XML 
> like I mean records have proper opening and closing tags. but fields 
> don't have closing tags (they rely on line ends).  Not all fields appear 
> in all records, but they do adhere to a defined sequence.

lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).

- James

-- 
Bulbflow: A Python framework for graph databases (http://bulbflow.com)

[toc] | [prev] | [next] | [standalone]

#12575

From	William Gill <nospam@domain.invalid>
Date	2011-09-01 14:38 -0400
Message-ID	<j3ojeg$164$1@dont-email.me>
In reply to	#12574

On 9/1/2011 1:58 PM, JT wrote:
> On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote:
>>
>> I have a text file with XML like records that I need to parse.  By XML
>> like I mean records have proper opening and closing tags. but fields
>> don't have closing tags (they rely on line ends).  Not all fields appear
>> in all records, but they do adhere to a defined sequence.
>
> lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).
>
> - James
>
Thanks to everyone.

Though I didn't get what I expected, it made me think more about the 
reason I need to parse these files to begin with.  So I'm going to do 
some more homework on the overall business application and work backward 
from there. Once I know how the data fits in the scheme of things, I 
will create an appropriate abstraction layer, either from scratch, or 
using one of the existing parsers mentioned, but I won't really know 
that until I have finished modeling.

[toc] | [prev] | [standalone]

csiph-web

Help parsing a text file

Contents

#12414 — Help parsing a text file

#12416

#12418

#12420

#12436

#12461

#12574

#12575