Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!hq-usenetpeers.eweka.nl!81.171.88.250.MISMATCH!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(of': 0.07; 'data):': 0.07; 'next,': 0.07; 'omit': 0.07; 'parsing': 0.07; 'problem?': 0.07; 'eof': 0.09; 'loop.': 0.09; 'portions': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'rules.': 0.09; 'splitting': 0.09; 'suggest': 0.11; 'index': 0.13; '"import': 0.16; 'boundaries,': 0.16; 'combinations': 0.16; 'confusion': 0.16; 'enough.': 0.16; 'intrinsic': 0.16; 'length,': 0.16; 'looping': 0.16; 'mall': 0.16; 'message-id:@dough.gmane.org': 0.16; 'perfect.': 0.16; 'pilot': 0.16; 'received:80.91.229.3': 0.16; 'received:dip.t-dialin.net': 0.16; 'received:plane.gmane.org': 0.16; 'received:t-dialin.net': 0.16; 'set,': 0.16; 'simplest': 0.16; 'slice.': 0.16; 'statement.': 0.16; 'subject:Discussion': 0.16; 'wrote:': 0.17; 'fix': 0.17; 'detect': 0.17; 'issue.': 0.20; 'trying': 0.21; 'together.': 0.21; 'discovery': 0.22; 'parse': 0.22; 'subject:Code': 0.22; 'statement': 0.23; 'task': 0.23; 'seems': 0.23; 'header:User- Agent:1': 0.26; 'coding': 0.27; 'separate': 0.27; 'question': 0.27; 'document.': 0.27; 'header:X-Complaints-To:1': 0.28; 'appending': 0.29; 'issues.': 0.29; 'lightning': 0.29; 'separated': 0.29; 'subject:some': 0.29; 'words': 0.29; 'thursday,': 0.30; 'helpful': 0.30; 'code': 0.31; 'file': 0.32; 'room': 0.32; 'print': 0.32; 'achieving': 0.33; 'curious': 0.33; 'problem': 0.33; 'to:addr:python-list': 0.33; 'another': 0.33; 'text': 0.34; 'list': 0.35; 'ahead': 0.35; 'city.': 0.35; 'text.': 0.35; 'so,': 0.35; 'board': 0.35; 'there': 0.35; 'received:org': 0.36; 'but': 0.36; 'data.': 0.36; 'indian': 0.36; 'should': 0.36; 'skip:p 20': 0.36; 'india': 0.36; 'subject:: ': 0.38; 'files': 0.38; 'some': 0.38; 'things': 0.38; 'to:addr:python.org': 0.39; 'header:Received:5': 0.40; 'end': 0.40; 'your': 0.60; 'group,': 0.60; 'july': 0.60; 'skip:u 10': 0.60; 'you.': 0.61; 'high': 0.61; 'improved': 0.62; 'email addr:gmail.com': 0.63; 'more': 0.63; 'here': 0.65; 'taking': 0.65; 'learned': 0.65; 'dear': 0.66; 'agencies': 0.66; 'applying': 0.69; 'increase': 0.72; 'day': 0.73; 'power': 0.74; 'bag': 0.75; 'connection.': 0.75; 'fearing': 0.84; 'forced': 0.84; 'nicely.': 0.84; 'nigerian': 0.84; 'spreads': 0.84; 'sri': 0.84; 'together,': 0.84; 'universe': 0.84; 'shopping': 0.87; 'aircraft': 0.91; 'evening': 0.91; 'try.': 0.91; 'authorities': 0.95 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Peter Otten <__peter__@web.de> Subject: Re: Discussion on some Code Issues Date: Thu, 05 Jul 2012 09:30:12 +0200 Organization: None References: <34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8Bit X-Gmane-NNTP-Posting-Host: p5084a920.dip.t-dialin.net User-Agent: KNode/4.7.3 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 98 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1341473418 news.xs4all.nl 6954 [2001:888:2000:d::a6]:58973 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:24899 subhabangalore@gmail.com wrote: > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: >> Dear Group, >> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to >> discuss some coding issues. If any one of this learned room can shower >> some light I would be helpful enough. >> >> I got to code a bunch of documents which are combined together. >> Like, >> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing. 2) The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection. 3) A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> The task is to separate the documents on the fly and to parse each of the >> documents with a definite set of rules. >> >> Now, the way I am processing is: >> I am clubbing all the documents together, as, >> >> A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing.The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection. A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> But they are separated by a tag set, like, >> A Mumbai-bound aircraft with 99 passengers on board was struck by >> lightning on Tuesday evening that led to complete communication failure >> in mid-air and forced the pilot to make an emergency landing.$ The >> discovery of a new sub-atomic particle that is key to understanding how >> the universe is built has an intrinsic Indian connection.$ A bomb >> explosion outside a shopping mall here on Tuesday left no one injured, >> but Nigerian authorities put security agencies on high alert fearing more >> such attacks in the city. >> >> To detect the document boundaries, I am splitting them into a bag of >> words and using a simple for loop as, for i in range(len(bag_words)): >> if bag_words[i]=="$": >> print (bag_words[i],i) >> >> There is no issue. I am segmenting it nicely. I am using annotated corpus >> so applying parse rules. >> >> The confusion comes next, >> >> As per my problem statement the size of the file (of documents combined >> together) won’t increase on the fly. So, just to support all kinds of >> combinations I am appending in a list the “I” values, taking its length, >> and using slice. Works perfect. Question is, is there a smarter way to >> achieve this, and a curious question if the documents are on the fly with >> no preprocessed tag set like “$” how may I do it? From a bunch without >> EOF isn’t it a classification problem? >> >> There is no question on parsing it seems I am achieving it independent of >> length of the document. >> >> If any one in the group can suggest how I am dealing with the problem and >> which portions should be improved and how? >> >> Thanking You in Advance, >> >> Best Regards, >> Subhabrata Banerjee. > > > Hi Steven, It is nice to see your post. They are nice and I learnt so many > things from you. "I" is for index of the loop. Now my clarification I > thought to do "import os" and process files in a loop but that is not my > problem statement. I have to make a big lump of text and detect one chunk. > Looping over the line number of file I am not using because I may not be > able to take the slices-this I need. I thought to give re.findall a try > but that is not giving me the slices. Slice spreads here. The power issue > of string! I would definitely give it a try. Happy Day Ahead Regards, > Subhabrata Banerjee. Then use re.finditer(): start = 0 for match in re.finditer(r"\$", data): end = match.start() print(start, end) print(data[start:end]) start = match.end() This will omit the last text. The simplest fix is to put another "$" separator at the end of your data.