Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'python,': 0.02; 'else:': 0.03; 'skip:[ 20': 0.04; 'subject:text': 0.05; 'lines,': 0.07; 'subject:bug': 0.07; 'subject:file': 0.07; "'')": 0.09; 'bug.': 0.09; 'character,': 0.09; 'lines:': 0.09; 'pointers': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'slow.': 0.09; 'runs': 0.10; 'python': 0.11; 'def': 0.12; 'bug': 0.12; 'jan': 0.12; 'assume': 0.14; '"-"': 0.16; "'''": 0.16; '(there': 0.16; 'dump': 0.16; 'hat,': 0.16; 'looping': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'reedy': 0.16; 'wrote:': 0.18; 'basically': 0.19; 'split': 0.19; 'examples': 0.20; 'starts': 0.20; 'thanks.': 0.20; '(the': 0.22; '>>>': 0.22; 'example': 0.22; 'header:User-Agent:1': 0.23; 'format,': 0.24; 'lets': 0.24; 'parse': 0.24; 'text.': 0.24; 'file.': 0.24; '(or': 0.24; 'task': 0.26; 'second': 0.26; 'header:X-Complaints-To:1': 0.27; 'header:In-Reply-To:1': 0.27; 'record': 0.27; 'character': 0.29; 'words': 0.29; 'along': 0.30; "i'm": 0.30; 'lines': 0.31; 'usually': 0.31; 'bad.': 0.31; 'block,': 0.31; 'filed': 0.31; 'requesting': 0.31; 'file': 0.32; 'stuff': 0.32; 'text': 0.33; 'up.': 0.33; 'becomes': 0.33; 'bugs': 0.33; '"the': 0.34; 'something': 0.35; 'but': 0.35; 'google': 0.35; 'there': 0.35; 'really': 0.36; 'yield': 0.36; 'next': 0.36; 'application': 0.37; 'too': 0.37; 'two': 0.37; 'list': 0.37; 'sometimes': 0.38; 'to:addr:python-list': 0.38; 'issue': 0.38; 'pm,': 0.38; 'rather': 0.38; 'bad': 0.39; 'generating': 0.39; 'received:71': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'how': 0.40; 'blank': 0.60; 'wonderful': 0.60; 'first': 0.61; 'such': 0.63; 'skip:n 10': 0.64; 'grab': 0.64; 'more': 0.64; 'love': 0.65; 'within': 0.65; 'here': 0.66; 'between': 0.67; 'wear': 0.68; 'goal': 0.75; 'grabbing': 0.84; 'received:fios.verizon.net': 0.84; 'that),': 0.91 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Terry Reedy Subject: Re: Parse bug text file Date: Sun, 27 Jul 2014 15:15:56 -0400 References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: pool-71-175-90-87.phlapa.fios.verizon.net User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 105 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1406488571 news.xs4all.nl 2859 [2001:888:2000:d::a6]:58827 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:75286 On 7/27/2014 2:08 PM, CM wrote: > I have a big text file of bugs that I want to use Python to parse > such that the bugs can be neatly filed into a database. I can bumble > toward a solution with looping but feel this is a classic example of > reinventing the wheel, and yet I'm finding it hard to Google for. > > Basically the file is structured like this (silly examples, of > course), with each of these three lets call a "bug block": > > > - BUG 2.13.14 When you wear a purple hat, the application locks up. > If you sing the theme to "The Love Boat", the application becomes > available again. > > - ISSUE 2.13.14 During thunderstorms, the application runs > backwards. > > - BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow. > That's too bad. > > > Generally, every bug block starts with a "-" as the first character, I will assume 'always' > then some words in all caps, a date in that format, and then the > descriptive text. There is always a blank line in between bug blocks, > but sometimes there may be a blank line within the bug description as > well. > > The goal is to grab each bug block, clean up that text (there are CRs > in it, etc., but I can do that), and dump it into a database record > (the db stuff I can do). Grabbing the date along the way would be > wonderful as well. > > I can go through it with opening the text file and reading in the > lines, and if the first character is a "-" then count that as the > start of a bug block, but I am not sure how to find the last line of > a bug block...it would be the line before the first line of the next > bug block, but not sure the best way to go about it. > > There must be a rather standard way to do something like this in > Python, and I'm requesting pointers toward that standard way (or what > this type of task is usually called). Thanks. Split the processing into two phases: generating individual bugs and processing each bug. Here is a prototype. with open(bugfile) as f: for bug in bugs(f): process(bug) Here are two examples of the first phase. Use the second for a big file. (If individual bugs are more than a few lines, I would collect lines in the generator in a list and use ''.join()). bugtext = '''\ - BUG 2.13.14 When you wear a purple hat, the application locks up. If you sing the theme to "The Love Boat", the application becomes available again. - ISSUE 2.13.14 During thunderstorms, the application runs backwards. - BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow. That's too bad ''' buglist1 = [bug.strip().replace('\n', '') for bug in bugtext[1:].split('\n-')] for bug in buglist1: print(bug) def bugs(lines): lines = iter(lines) bug = next(lines)[1:] for line in lines: if line[:1] != '-': bug += line else: yield bug.strip() bug = line[1:] yield bug.strip() buglist2 = [bug for bug in bugs(bugtext.splitlines())] for bug in buglist2: print(bug) print(buglist1 == buglist2) >>> BUG 2.13.14 When you wear a purple hat, the application locks up.If you sing the theme to "The Love Boat",the application becomes available again. ISSUE 2.13.14 During thunderstorms, the application runs backwards. BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.That's too bad BUG 2.13.14 When you wear a purple hat, the application locks up.If you sing the theme to "The Love Boat",the application becomes available again. ISSUE 2.13.14 During thunderstorms, the application runs backwards. BUG/OPTIMIZE 11.12.12: Sometimes the application is really slow.That's too bad True Now write process(bug) -- Terry Jan Reedy