Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #75276 > unrolled thread

Parse bug text file

Started byCM <cmpython@gmail.com>
First post2014-07-27 11:08 -0700
Last post2014-07-29 00:48 -0700
Articles 6 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  Parse bug text file CM <cmpython@gmail.com> - 2014-07-27 11:08 -0700
    Re: Parse bug text file Chris Angelico <rosuav@gmail.com> - 2014-07-28 04:17 +1000
    Re: Parse bug text file Terry Reedy <tjreedy@udel.edu> - 2014-07-27 15:15 -0400
    Re: Parse bug text file wxjmfauth@gmail.com - 2014-07-27 13:55 -0700
    Re: Parse bug text file CM <cmpython@gmail.com> - 2014-07-28 11:37 -0700
      Re: Parse bug text file wxjmfauth@gmail.com - 2014-07-29 00:48 -0700

#75276 — Parse bug text file

FromCM <cmpython@gmail.com>
Date2014-07-27 11:08 -0700
SubjectParse bug text file
Message-ID<de64e370-d83a-4919-863b-e744ad20b62a@googlegroups.com>
I have a big text file of bugs that I want to use Python to parse such that the bugs can be neatly filed into a database. I can bumble toward a solution with looping but feel this is a classic example of reinventing the wheel, and yet I'm finding it hard to Google for.

Basically the file is structured like this (silly examples, of course), with each of these three lets call a "bug block":


- BUG 2.13.14  When you wear a purple hat, the application locks up.  If you sing the theme to "The Love Boat", the application becomes available again.

- ISSUE 2.13.14  During thunderstorms, the application runs backwards.

- BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.  That's too bad. 


Generally, every bug block starts with a "-" as the first character, then some words in all caps, a date in that format, and then the descriptive text. There is always a blank line in between bug blocks, but sometimes there may be a blank line within the bug description as well.

The goal is to grab each bug block, clean up that text (there are CRs in it, etc., but I can do that), and dump it into a database record (the db stuff I can do).  Grabbing the date along the way would be wonderful as well.

I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it.

There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called).  Thanks.

[toc] | [next] | [standalone]


#75280

FromChris Angelico <rosuav@gmail.com>
Date2014-07-28 04:17 +1000
Message-ID<mailman.12364.1406485035.18130.python-list@python.org>
In reply to#75276
On Mon, Jul 28, 2014 at 4:08 AM, CM <cmpython@gmail.com> wrote:
> I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it.
>
> There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called).  Thanks.

This is a fairly standard sort of job, but there's not really a
ready-to-go bit of code. This is just straight-forward text
processing.

What I'd do is a stateful parser. Something like this:

block = None
with open("bugs.txt",encoding="utf-8") as f:
    for line in f:
        if line.startswith("- "):
            if block: save_to_database(block)
            block = line
        else:
            block += "\n" + line
if block: save_to_database(block) # don't forget to grab that last one!

This is extremely simple, and you might want to use a regex to look
for the upper-case word and date as well (this would falsely notice
any description line that happens to begin with a hyphen and a space).
But the basic idea is: initialize an accumulator to a null state;
whenever you find the beginning of something, emit the previous and
reset the accumulator; otherwise, add to the accumulator. At the end,
emit any current block.

ChrisA

[toc] | [prev] | [next] | [standalone]


#75286

FromTerry Reedy <tjreedy@udel.edu>
Date2014-07-27 15:15 -0400
Message-ID<mailman.12368.1406488571.18130.python-list@python.org>
In reply to#75276
On 7/27/2014 2:08 PM, CM wrote:
> I have a big text file of bugs that I want to use Python to parse
> such that the bugs can be neatly filed into a database. I can bumble
> toward a solution with looping but feel this is a classic example of
> reinventing the wheel, and yet I'm finding it hard to Google for.
>
> Basically the file is structured like this (silly examples, of
> course), with each of these three lets call a "bug block":
>
>
> - BUG 2.13.14  When you wear a purple hat, the application locks up.
> If you sing the theme to "The Love Boat", the application becomes
> available again.
>
> - ISSUE 2.13.14  During thunderstorms, the application runs
> backwards.
>
> - BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.
> That's too bad.
>
>
> Generally, every bug block starts with a "-" as the first character,

I will assume 'always'

> then some words in all caps, a date in that format, and then the
> descriptive text. There is always a blank line in between bug blocks,
> but sometimes there may be a blank line within the bug description as
> well.
>
> The goal is to grab each bug block, clean up that text (there are CRs
> in it, etc., but I can do that), and dump it into a database record
> (the db stuff I can do).  Grabbing the date along the way would be
> wonderful as well.
>
> I can go through it with opening the text file and reading in the
> lines, and if the first character is a "-" then count that as the
> start of a bug block, but I am not sure how to find the last line of
> a bug block...it would be the line before the first line of the next
> bug block, but not sure the best way to go about it.
>
> There must be a rather standard way to do something like this in
> Python, and I'm requesting pointers toward that standard way (or what
> this type of task is usually called).  Thanks.

Split the processing into two phases: generating individual bugs and 
processing each bug. Here is a prototype.

with open(bugfile) as f:
     for bug in bugs(f):
         process(bug)

Here are two examples of the first phase. Use the second for a big file. 
  (If individual bugs are more than a few lines, I would collect lines 
in the generator in a list and use ''.join(<list>)).

bugtext = '''\
- BUG 2.13.14  When you wear a purple hat, the application locks up.
If you sing the theme to "The Love Boat",
the application becomes available again.

- ISSUE 2.13.14  During thunderstorms, the application runs backwards.

- BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.
That's too bad
'''

buglist1 = [bug.strip().replace('\n', '') for bug in 
bugtext[1:].split('\n-')]
for bug in buglist1: print(bug)

def bugs(lines):
     lines = iter(lines)
     bug = next(lines)[1:]
     for line in lines:
         if line[:1] != '-':
             bug += line
         else:
             yield bug.strip()
             bug = line[1:]
     yield bug.strip()


buglist2 = [bug for bug in bugs(bugtext.splitlines())]
for bug in buglist2: print(bug)
print(buglist1 == buglist2)

 >>>
BUG 2.13.14  When you wear a purple hat, the application locks up.If you 
sing the theme to "The Love Boat",the application becomes available again.
ISSUE 2.13.14  During thunderstorms, the application runs backwards.
BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.That's 
too bad
BUG 2.13.14  When you wear a purple hat, the application locks up.If you 
sing the theme to "The Love Boat",the application becomes available again.
ISSUE 2.13.14  During thunderstorms, the application runs backwards.
BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.That's 
too bad
True

Now write process(bug)

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#75290

Fromwxjmfauth@gmail.com
Date2014-07-27 13:55 -0700
Message-ID<1e1d7380-9bfb-4980-a3ae-dafbe0ef1b94@googlegroups.com>
In reply to#75276
Le dimanche 27 juillet 2014 20:08:06 UTC+2, CM a écrit :
> I have a big text file of bugs that I want to use Python to parse such that the bugs can be neatly filed into a database. I can bumble toward a solution with looping but feel this is a classic example of reinventing the wheel, and yet I'm finding it hard to Google for.
> 
> 
> 
> Basically the file is structured like this (silly examples, of course), with each of these three lets call a "bug block":
> 
> 
> 
> 
> 
> - BUG 2.13.14  When you wear a purple hat, the application locks up.  If you sing the theme to "The Love Boat", the application becomes available again.
> 
> 
> 
> - ISSUE 2.13.14  During thunderstorms, the application runs backwards.
> 
> 
> 
> - BUG/OPTIMIZE 11.12.12:  Sometimes the application is really slow.  That's too bad. 
> 
> 
> 
> 
> 
> Generally, every bug block starts with a "-" as the first character, then some words in all caps, a date in that format, and then the descriptive text. There is always a blank line in between bug blocks, but sometimes there may be a blank line within the bug description as well.
> 
> 
> 
> The goal is to grab each bug block, clean up that text (there are CRs in it, etc., but I can do that), and dump it into a database record (the db stuff I can do).  Grabbing the date along the way would be wonderful as well.
> 
> 
> 
> I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it.
> 
> 
> 
> There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called).  Thanks.

The real question: how to open and close a block given a
delimiter?

>>> s = """\
... - BUG 2.13.14  When you wear
... available again.
... 
...    - ISSUE 2.13.14  Duringthunderstorms
... 
... - BUG/OPTIMIZE 11.12.12:  Sometimes
... 
... - aaa -bbb
... 
... -
... """
>>> def z(s):
...     r = []
...     inblock = False
...     t = ''
...     i = 0
...     while i < len(s):
...         if s[i] == '-':
...             if inblock:
...                 r.append(t)
...                 t = s[i]
...             else:
...                 t = t + s[i]
...                 inblock = not inblock
...         else:
...             t = t + s[i]
...         i = i + 1
...     r.append(t)
...     return r
...     
>>> r = z(s)
>>> for e in r:
...     print(e)
...     
- BUG 2.13.14  When you wear
available again.

   
- ISSUE 2.13.14  Duringthunderstorms


- BUG/OPTIMIZE 11.12.12:  Sometimes


- aaa 
-bbb


-

>>> ''.join(r) == s
True

jmf

[toc] | [prev] | [next] | [standalone]


#75320

FromCM <cmpython@gmail.com>
Date2014-07-28 11:37 -0700
Message-ID<4a5222f3-78a8-4812-897f-3864da0733e6@googlegroups.com>
In reply to#75276
Thank you, Chris, Terry, and jmf, for these pointers.  Very helpful.

-CM

[toc] | [prev] | [next] | [standalone]


#75344

Fromwxjmfauth@gmail.com
Date2014-07-29 00:48 -0700
Message-ID<35126ce2-0976-44b5-9111-86454b0501b8@googlegroups.com>
In reply to#75320
Le lundi 28 juillet 2014 20:37:50 UTC+2, CM a écrit :
> Thank you, Chris, Terry, and jmf, for these pointers.  Very helpful.
> 
> 
> 
> -CM

I'm wondering what "big text file" means.

>>> with open('UnicodeData.txt', 'r', encoding='ascii') as f:
...     r = f.read()
...     
>>> len(r)
1366791
>>> a = r.split('\n')    # mimicking a '-'
>>> len(a)
24430
>>> na = [e for e in a]
>>> 

jmf

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web