Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #75276 > unrolled thread
| Started by | CM <cmpython@gmail.com> |
|---|---|
| First post | 2014-07-27 11:08 -0700 |
| Last post | 2014-07-29 00:48 -0700 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
Parse bug text file CM <cmpython@gmail.com> - 2014-07-27 11:08 -0700
Re: Parse bug text file Chris Angelico <rosuav@gmail.com> - 2014-07-28 04:17 +1000
Re: Parse bug text file Terry Reedy <tjreedy@udel.edu> - 2014-07-27 15:15 -0400
Re: Parse bug text file wxjmfauth@gmail.com - 2014-07-27 13:55 -0700
Re: Parse bug text file CM <cmpython@gmail.com> - 2014-07-28 11:37 -0700
Re: Parse bug text file wxjmfauth@gmail.com - 2014-07-29 00:48 -0700
| From | CM <cmpython@gmail.com> |
|---|---|
| Date | 2014-07-27 11:08 -0700 |
| Subject | Parse bug text file |
| Message-ID | <de64e370-d83a-4919-863b-e744ad20b62a@googlegroups.com> |
I have a big text file of bugs that I want to use Python to parse such that the bugs can be neatly filed into a database. I can bumble toward a solution with looping but feel this is a classic example of reinventing the wheel, and yet I'm finding it hard to Google for. Basically the file is structured like this (silly examples, of course), with each of these three lets call a "bug block": - BUG 2.13.14 When you wear a purple hat, the application locks up. If you sing the theme to "The Love Boat", the application becomes available again. - ISSUE 2.13.14 During thunderstorms, the application runs backwards. - BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow. That's too bad. Generally, every bug block starts with a "-" as the first character, then some words in all caps, a date in that format, and then the descriptive text. There is always a blank line in between bug blocks, but sometimes there may be a blank line within the bug description as well. The goal is to grab each bug block, clean up that text (there are CRs in it, etc., but I can do that), and dump it into a database record (the db stuff I can do). Grabbing the date along the way would be wonderful as well. I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it. There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called). Thanks.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-07-28 04:17 +1000 |
| Message-ID | <mailman.12364.1406485035.18130.python-list@python.org> |
| In reply to | #75276 |
On Mon, Jul 28, 2014 at 4:08 AM, CM <cmpython@gmail.com> wrote:
> I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it.
>
> There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called). Thanks.
This is a fairly standard sort of job, but there's not really a
ready-to-go bit of code. This is just straight-forward text
processing.
What I'd do is a stateful parser. Something like this:
block = None
with open("bugs.txt",encoding="utf-8") as f:
for line in f:
if line.startswith("- "):
if block: save_to_database(block)
block = line
else:
block += "\n" + line
if block: save_to_database(block) # don't forget to grab that last one!
This is extremely simple, and you might want to use a regex to look
for the upper-case word and date as well (this would falsely notice
any description line that happens to begin with a hyphen and a space).
But the basic idea is: initialize an accumulator to a null state;
whenever you find the beginning of something, emit the previous and
reset the accumulator; otherwise, add to the accumulator. At the end,
emit any current block.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-07-27 15:15 -0400 |
| Message-ID | <mailman.12368.1406488571.18130.python-list@python.org> |
| In reply to | #75276 |
On 7/27/2014 2:08 PM, CM wrote:
> I have a big text file of bugs that I want to use Python to parse
> such that the bugs can be neatly filed into a database. I can bumble
> toward a solution with looping but feel this is a classic example of
> reinventing the wheel, and yet I'm finding it hard to Google for.
>
> Basically the file is structured like this (silly examples, of
> course), with each of these three lets call a "bug block":
>
>
> - BUG 2.13.14 When you wear a purple hat, the application locks up.
> If you sing the theme to "The Love Boat", the application becomes
> available again.
>
> - ISSUE 2.13.14 During thunderstorms, the application runs
> backwards.
>
> - BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.
> That's too bad.
>
>
> Generally, every bug block starts with a "-" as the first character,
I will assume 'always'
> then some words in all caps, a date in that format, and then the
> descriptive text. There is always a blank line in between bug blocks,
> but sometimes there may be a blank line within the bug description as
> well.
>
> The goal is to grab each bug block, clean up that text (there are CRs
> in it, etc., but I can do that), and dump it into a database record
> (the db stuff I can do). Grabbing the date along the way would be
> wonderful as well.
>
> I can go through it with opening the text file and reading in the
> lines, and if the first character is a "-" then count that as the
> start of a bug block, but I am not sure how to find the last line of
> a bug block...it would be the line before the first line of the next
> bug block, but not sure the best way to go about it.
>
> There must be a rather standard way to do something like this in
> Python, and I'm requesting pointers toward that standard way (or what
> this type of task is usually called). Thanks.
Split the processing into two phases: generating individual bugs and
processing each bug. Here is a prototype.
with open(bugfile) as f:
for bug in bugs(f):
process(bug)
Here are two examples of the first phase. Use the second for a big file.
(If individual bugs are more than a few lines, I would collect lines
in the generator in a list and use ''.join(<list>)).
bugtext = '''\
- BUG 2.13.14 When you wear a purple hat, the application locks up.
If you sing the theme to "The Love Boat",
the application becomes available again.
- ISSUE 2.13.14 During thunderstorms, the application runs backwards.
- BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.
That's too bad
'''
buglist1 = [bug.strip().replace('\n', '') for bug in
bugtext[1:].split('\n-')]
for bug in buglist1: print(bug)
def bugs(lines):
lines = iter(lines)
bug = next(lines)[1:]
for line in lines:
if line[:1] != '-':
bug += line
else:
yield bug.strip()
bug = line[1:]
yield bug.strip()
buglist2 = [bug for bug in bugs(bugtext.splitlines())]
for bug in buglist2: print(bug)
print(buglist1 == buglist2)
>>>
BUG 2.13.14 When you wear a purple hat, the application locks up.If you
sing the theme to "The Love Boat",the application becomes available again.
ISSUE 2.13.14 During thunderstorms, the application runs backwards.
BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.That's
too bad
BUG 2.13.14 When you wear a purple hat, the application locks up.If you
sing the theme to "The Love Boat",the application becomes available again.
ISSUE 2.13.14 During thunderstorms, the application runs backwards.
BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.That's
too bad
True
Now write process(bug)
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-07-27 13:55 -0700 |
| Message-ID | <1e1d7380-9bfb-4980-a3ae-dafbe0ef1b94@googlegroups.com> |
| In reply to | #75276 |
Le dimanche 27 juillet 2014 20:08:06 UTC+2, CM a écrit : > I have a big text file of bugs that I want to use Python to parse such that the bugs can be neatly filed into a database. I can bumble toward a solution with looping but feel this is a classic example of reinventing the wheel, and yet I'm finding it hard to Google for. > > > > Basically the file is structured like this (silly examples, of course), with each of these three lets call a "bug block": > > > > > > - BUG 2.13.14 When you wear a purple hat, the application locks up. If you sing the theme to "The Love Boat", the application becomes available again. > > > > - ISSUE 2.13.14 During thunderstorms, the application runs backwards. > > > > - BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow. That's too bad. > > > > > > Generally, every bug block starts with a "-" as the first character, then some words in all caps, a date in that format, and then the descriptive text. There is always a blank line in between bug blocks, but sometimes there may be a blank line within the bug description as well. > > > > The goal is to grab each bug block, clean up that text (there are CRs in it, etc., but I can do that), and dump it into a database record (the db stuff I can do). Grabbing the date along the way would be wonderful as well. > > > > I can go through it with opening the text file and reading in the lines, and if the first character is a "-" then count that as the start of a bug block, but I am not sure how to find the last line of a bug block...it would be the line before the first line of the next bug block, but not sure the best way to go about it. > > > > There must be a rather standard way to do something like this in Python, and I'm requesting pointers toward that standard way (or what this type of task is usually called). Thanks. The real question: how to open and close a block given a delimiter? >>> s = """\ ... - BUG 2.13.14 When you wear ... available again. ... ... - ISSUE 2.13.14 Duringthunderstorms ... ... - BUG/OPTIMIZE 11.12.12: Sometimes ... ... - aaa -bbb ... ... - ... """ >>> def z(s): ... r = [] ... inblock = False ... t = '' ... i = 0 ... while i < len(s): ... if s[i] == '-': ... if inblock: ... r.append(t) ... t = s[i] ... else: ... t = t + s[i] ... inblock = not inblock ... else: ... t = t + s[i] ... i = i + 1 ... r.append(t) ... return r ... >>> r = z(s) >>> for e in r: ... print(e) ... - BUG 2.13.14 When you wear available again. - ISSUE 2.13.14 Duringthunderstorms - BUG/OPTIMIZE 11.12.12: Sometimes - aaa -bbb - >>> ''.join(r) == s True jmf
[toc] | [prev] | [next] | [standalone]
| From | CM <cmpython@gmail.com> |
|---|---|
| Date | 2014-07-28 11:37 -0700 |
| Message-ID | <4a5222f3-78a8-4812-897f-3864da0733e6@googlegroups.com> |
| In reply to | #75276 |
Thank you, Chris, Terry, and jmf, for these pointers. Very helpful. -CM
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-07-29 00:48 -0700 |
| Message-ID | <35126ce2-0976-44b5-9111-86454b0501b8@googlegroups.com> |
| In reply to | #75320 |
Le lundi 28 juillet 2014 20:37:50 UTC+2, CM a écrit :
> Thank you, Chris, Terry, and jmf, for these pointers. Very helpful.
>
>
>
> -CM
I'm wondering what "big text file" means.
>>> with open('UnicodeData.txt', 'r', encoding='ascii') as f:
... r = f.read()
...
>>> len(r)
1366791
>>> a = r.split('\n') # mimicking a '-'
>>> len(a)
24430
>>> na = [e for e in a]
>>>
jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web