Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #12414 > unrolled thread
| Started by | William Gill <noreply@domain.invalid> |
|---|---|
| First post | 2011-08-29 14:21 -0400 |
| Last post | 2011-09-01 14:38 -0400 |
| Articles | 8 — 7 participants |
Back to article view | Back to comp.lang.python
Help parsing a text file William Gill <noreply@domain.invalid> - 2011-08-29 14:21 -0400
Re: Help parsing a text file Philip Semanchuk <philip@semanchuk.com> - 2011-08-29 14:31 -0400
Re: Help parsing a text file William Gill <nospam@domain.invalid> - 2011-08-29 14:56 -0400
Re: Help parsing a text file Thomas Jollans <t@jollybox.de> - 2011-08-29 23:05 +0200
Re: Help parsing a text file "Waldek M." <wm@localhost.localdomain> - 2011-08-30 13:50 +0200
Re: Help parsing a text file Tim Roberts <timr@probo.com> - 2011-08-30 22:37 -0700
Re: Help parsing a text file JT <james.thornton@gmail.com> - 2011-09-01 10:58 -0700
Re: Help parsing a text file William Gill <nospam@domain.invalid> - 2011-09-01 14:38 -0400
| From | William Gill <noreply@domain.invalid> |
|---|---|
| Date | 2011-08-29 14:21 -0400 |
| Subject | Help parsing a text file |
| Message-ID | <j3glai$1mu$1@dont-email.me> |
I haven't done much with Python for a couple years, bouncing around between other languages and scripts as needs suggest, so I have some minor difficulty keeping Python functionality Python functionality in my head, but I can overcome that as the cobwebs clear. Though I do seem to keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard). I have a text file with XML like records that I need to parse. By XML like I mean records have proper opening and closing tags. but fields don't have closing tags (they rely on line ends). Not all fields appear in all records, but they do adhere to a defined sequence. My initial passes into Python have been very unfocused (a scatter gun of too many possible directions, yielding very messy results), so I'm asking for some suggestions, or algorithms (possibly even examples)that may help me focus. I'm not asking anyone to write my code, just to nudge me toward a more disciplined approach to a common task, and I promise to put in the effort to understand the underlying fundamentals.
[toc] | [next] | [standalone]
| From | Philip Semanchuk <philip@semanchuk.com> |
|---|---|
| Date | 2011-08-29 14:31 -0400 |
| Message-ID | <mailman.552.1314642679.27778.python-list@python.org> |
| In reply to | #12414 |
On Aug 29, 2011, at 2:21 PM, William Gill wrote: > I haven't done much with Python for a couple years, bouncing around between other languages and scripts as needs suggest, so I have some minor difficulty keeping Python functionality Python functionality in my head, but I can overcome that as the cobwebs clear. Though I do seem to keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard). > > I have a text file with XML like records that I need to parse. By XML like I mean records have proper opening and closing tags. but fields don't have closing tags (they rely on line ends). Not all fields appear in all records, but they do adhere to a defined sequence. > > My initial passes into Python have been very unfocused (a scatter gun of too many possible directions, yielding very messy results), so I'm asking for some suggestions, or algorithms (possibly even examples)that may help me focus. > > I'm not asking anyone to write my code, just to nudge me toward a more disciplined approach to a common task, and I promise to put in the effort to understand the underlying fundamentals. If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing. Cheers Philip
[toc] | [prev] | [next] | [standalone]
| From | William Gill <nospam@domain.invalid> |
|---|---|
| Date | 2011-08-29 14:56 -0400 |
| Message-ID | <j3gnc0$5t1$1@dont-email.me> |
| In reply to | #12416 |
On 8/29/2011 2:31 PM, Philip Semanchuk wrote: > > If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing. > Possibly, but I would still need the same search algorithms to find the opening tag for the field, then find and replace the next line end with a matching closing tag. So it seems to me that the starting point is the same, and then it's my choice to either process the substrings myself or employ something like ElementTree.
[toc] | [prev] | [next] | [standalone]
| From | Thomas Jollans <t@jollybox.de> |
|---|---|
| Date | 2011-08-29 23:05 +0200 |
| Message-ID | <mailman.556.1314651909.27778.python-list@python.org> |
| In reply to | #12414 |
On 29/08/11 20:21, William Gill wrote: > I haven't done much with Python for a couple years, bouncing around > between other languages and scripts as needs suggest, so I have some > minor difficulty keeping Python functionality Python functionality in my > head, but I can overcome that as the cobwebs clear. Though I do seem to > keep tripping over the same Py2 -> Py3 syntax changes (old habits die > hard). > > I have a text file with XML like records that I need to parse. By XML > like I mean records have proper opening and closing tags. but fields > don't have closing tags (they rely on line ends). Not all fields appear > in all records, but they do adhere to a defined sequence. > > My initial passes into Python have been very unfocused (a scatter gun of > too many possible directions, yielding very messy results), so I'm > asking for some suggestions, or algorithms (possibly even examples)that > may help me focus. > > I'm not asking anyone to write my code, just to nudge me toward a more > disciplined approach to a common task, and I promise to put in the > effort to understand the underlying fundamentals. A name that is often thrown around on this list for this kind of question is pyparsing. Now, I don't know anything about it myself, but it may be worth looking into. Otherwise, if you say it's similar to XML, you might want to take a cue from XML processing when it comes to dealing with the file. You could emulate the stream-based approach taken by SAX or eXpat - have methods that handle the different events that can occur - for XML this is "start tag", "end tag", "text node", "processing instruction", etc., in your case, it might be "start/end record", "field data", etc. That way, you could separate the code that keeps track of the current record, and how the data fits together to make an object structure, and the parsing code, that knows how to convert a line of data into something meaningful. Thomas
[toc] | [prev] | [next] | [standalone]
| From | "Waldek M." <wm@localhost.localdomain> |
|---|---|
| Date | 2011-08-30 13:50 +0200 |
| Message-ID | <1mpbiyq718zub.dlg@localhost.localdomain> |
| In reply to | #12420 |
On Mon, 29 Aug 2011 23:05:23 +0200, Thomas Jollans wrote: > A name that is often thrown around on this list for this kind of > question is pyparsing. Now, I don't know anything about it myself, but > it may be worth looking into. Definitely. I did use it and even though it's not perfect - it's very useful indeed. Due to it's nature it is not a demon of speed when parsing complex and big structures, so you might want to keep it in mind. But I whole-heartedly recommend it. Br. Waldek
[toc] | [prev] | [next] | [standalone]
| From | Tim Roberts <timr@probo.com> |
|---|---|
| Date | 2011-08-30 22:37 -0700 |
| Message-ID | <a0ir57hq08nfegtsbdma5rnom0pg6mignv@4ax.com> |
| In reply to | #12414 |
William Gill <noreply@domain.invalid> wrote: > >My initial passes into Python have been very unfocused (a scatter gun of >too many possible directions, yielding very messy results), so I'm >asking for some suggestions, or algorithms (possibly even examples)that >may help me focus. > >I'm not asking anyone to write my code, just to nudge me toward a more >disciplined approach to a common task, and I promise to put in the >effort to understand the underlying fundamentals. Python includes "sgmllib", which was designed to parse SGML-based files, including both neat XML and slimy HTML, and "htmllib", which derives from it. I have used "htmllib" to parse HTML files where the tags were not properly closed. Perhaps you could start from "htmllib" and modify it to handle the quirks in your particular format. -- Tim Roberts, timr@probo.com Providenza & Boekelheide, Inc.
[toc] | [prev] | [next] | [standalone]
| From | JT <james.thornton@gmail.com> |
|---|---|
| Date | 2011-09-01 10:58 -0700 |
| Message-ID | <015c1b3a-947f-4e11-8b1b-6f4ae52c31fd@glegroupsg2000goo.googlegroups.com> |
| In reply to | #12414 |
On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote: > > I have a text file with XML like records that I need to parse. By XML > like I mean records have proper opening and closing tags. but fields > don't have closing tags (they rely on line ends). Not all fields appear > in all records, but they do adhere to a defined sequence. lxml can parse XML and broken HTML (see http://lxml.de/parsing.html). - James -- Bulbflow: A Python framework for graph databases (http://bulbflow.com)
[toc] | [prev] | [next] | [standalone]
| From | William Gill <nospam@domain.invalid> |
|---|---|
| Date | 2011-09-01 14:38 -0400 |
| Message-ID | <j3ojeg$164$1@dont-email.me> |
| In reply to | #12574 |
On 9/1/2011 1:58 PM, JT wrote: > On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote: >> >> I have a text file with XML like records that I need to parse. By XML >> like I mean records have proper opening and closing tags. but fields >> don't have closing tags (they rely on line ends). Not all fields appear >> in all records, but they do adhere to a defined sequence. > > lxml can parse XML and broken HTML (see http://lxml.de/parsing.html). > > - James > Thanks to everyone. Though I didn't get what I expected, it made me think more about the reason I need to parse these files to begin with. So I'm going to do some more homework on the overall business application and work backward from there. Once I know how the data fits in the scheme of things, I will create an appropriate abstraction layer, either from scratch, or using one of the existing parsers mentioned, but I won't really know that until I have finished modeling.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web