Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!news.mixmin.net!hq-usenetpeers.eweka.nl!81.171.88.250.MISMATCH!newsfeed.eweka.nl!eweka.nl!feeder3.eweka.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
To: python-list@python.org
From: Peter Otten <__peter__@web.de>
Subject: Re: Discussion on some Code Issues
Date: Thu, 05 Jul 2012 09:30:12 +0200
Organization: None
References: <a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com> <34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8Bit
User-Agent: KNode/4.7.3
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1814.1341473418.4697.python-list@python.org>
Lines: 98
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:24899

subhabangalore@gmail.com wrote:

> On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
>> Dear Group,
>> 
>> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
>> discuss some coding issues. If any one of this learned room can shower
>> some light I would be helpful enough.
>> 
>> I got to code a bunch of documents  which are combined together.
>> Like,
>> 
>> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing. 2) The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. 3) A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> The task is to separate the documents on the fly and to parse each of the
>> documents with a definite set of rules.
>> 
>> Now, the way I am processing is:
>> I am clubbing all the documents together, as,
>> 
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> But they are separated by a tag set, like,
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.$ The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection.$ A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> To detect the document boundaries, I am splitting them into a bag of
>> words and using a simple for loop as, for i in range(len(bag_words)):
>>         if bag_words[i]=="$":
>>             print (bag_words[i],i)
>> 
>> There is no issue. I am segmenting it nicely. I am using annotated corpus
>> so applying parse rules.
>> 
>> The confusion comes next,
>> 
>> As per my problem statement the size of the file (of documents combined
>> together) won’t increase on the fly. So, just to support all kinds of
>> combinations I am appending in a list the “I” values, taking its length,
>> and using slice. Works perfect. Question is, is there a smarter way to
>> achieve this, and a curious question if the documents are on the fly with
>> no preprocessed tag set like “$” how may I do it? From a bunch without
>> EOF isn’t it a classification problem?
>> 
>> There is no question on parsing it seems I am achieving it independent of
>> length of the document.
>> 
>> If any one in the group can suggest how I am dealing with the problem and
>> which portions should be improved and how?
>> 
>> Thanking You in Advance,
>> 
>> Best Regards,
>> Subhabrata Banerjee.
> 
> 
> Hi Steven, It is nice to see your post. They are nice and I learnt so many
> things from you. "I" is for index of the loop. Now my clarification I
> thought to do "import os" and process files in a loop but that is not my
> problem statement. I have to make a big lump of text and detect one chunk.
> Looping over the line number of file I am not using because I may not be
> able to take the slices-this I need. I thought to give re.findall a try
> but that is not giving me the slices. Slice spreads here. The power issue
> of string! I would definitely give it a try. Happy Day Ahead Regards,
> Subhabrata Banerjee.

Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
    end = match.start()
    print(start, end)
    print(data[start:end])
    start = match.end()

This will omit the last text. The simplest fix is to put another "$" 
separator at the end of your data.