Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Newsgroups: comp.lang.python
Date: Thu, 5 Jul 2012 07:33:27 -0700 (PDT)
In-Reply-To: <mailman.1814.1341473418.4697.python-list@python.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=122.161.240.29; posting-account=6SonuQoAAACzSakS5dCECcJQe6ylLrzY
References: <a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com> <34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com> <mailman.1814.1341473418.4697.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Subject: Re: Discussion on some Code Issues
From: subhabangalore@gmail.com
To: comp.lang.python@googlegroups.com
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Cc: python-list@python.org
Precedence: list
Message-ID: <mailman.1828.1341498810.4697.python-list@python.org>
Lines: 131
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:24921

Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of c=
ode that is easy but then how to find out the slices then, btw do you know =
in any case may I convert the index position of file to the list position p=
rovided I am writing the list for the same file we are reading.=20

Best Regards,
Subhabrata.=20

On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
> subhabangalore@gmail.com wrote:
>=20
> > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> >> Dear Group,
> >>=20
> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
> >> discuss some coding issues. If any one of this learned room can shower
> >> some light I would be helpful enough.
> >>=20
> >> I got to code a bunch of documents  which are combined together.
> >> Like,
> >>=20
> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failur=
e
> >> in mid-air and forced the pilot to make an emergency landing. 2) The
> >> discovery of a new sub-atomic particle that is key to understanding ho=
w
> >> the universe is built has an intrinsic Indian connection. 3) A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing m=
ore
> >> such attacks in the city.
> >>=20
> >> The task is to separate the documents on the fly and to parse each of =
the
> >> documents with a definite set of rules.
> >>=20
> >> Now, the way I am processing is:
> >> I am clubbing all the documents together, as,
> >>=20
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failur=
e
> >> in mid-air and forced the pilot to make an emergency landing.The
> >> discovery of a new sub-atomic particle that is key to understanding ho=
w
> >> the universe is built has an intrinsic Indian connection. A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing m=
ore
> >> such attacks in the city.
> >>=20
> >> But they are separated by a tag set, like,
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failur=
e
> >> in mid-air and forced the pilot to make an emergency landing.$ The
> >> discovery of a new sub-atomic particle that is key to understanding ho=
w
> >> the universe is built has an intrinsic Indian connection.$ A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing m=
ore
> >> such attacks in the city.
> >>=20
> >> To detect the document boundaries, I am splitting them into a bag of
> >> words and using a simple for loop as, for i in range(len(bag_words)):
> >>         if bag_words[i]=3D=3D"$":
> >>             print (bag_words[i],i)
> >>=20
> >> There is no issue. I am segmenting it nicely. I am using annotated cor=
pus
> >> so applying parse rules.
> >>=20
> >> The confusion comes next,
> >>=20
> >> As per my problem statement the size of the file (of documents combine=
d
> >> together) won=92t increase on the fly. So, just to support all kinds o=
f
> >> combinations I am appending in a list the =93I=94 values, taking its l=
ength,
> >> and using slice. Works perfect. Question is, is there a smarter way to
> >> achieve this, and a curious question if the documents are on the fly w=
ith
> >> no preprocessed tag set like =93$=94 how may I do it? From a bunch wit=
hout
> >> EOF isn=92t it a classification problem?
> >>=20
> >> There is no question on parsing it seems I am achieving it independent=
 of
> >> length of the document.
> >>=20
> >> If any one in the group can suggest how I am dealing with the problem =
and
> >> which portions should be improved and how?
> >>=20
> >> Thanking You in Advance,
> >>=20
> >> Best Regards,
> >> Subhabrata Banerjee.
> >=20
> >=20
> > Hi Steven, It is nice to see your post. They are nice and I learnt so m=
any
> > things from you. "I" is for index of the loop. Now my clarification I
> > thought to do "import os" and process files in a loop but that is not m=
y
> > problem statement. I have to make a big lump of text and detect one chu=
nk.
> > Looping over the line number of file I am not using because I may not b=
e
> > able to take the slices-this I need. I thought to give re.findall a try
> > but that is not giving me the slices. Slice spreads here. The power iss=
ue
> > of string! I would definitely give it a try. Happy Day Ahead Regards,
> > Subhabrata Banerjee.
>=20
> Then use re.finditer():
>=20
> start =3D 0
> for match in re.finditer(r"\$", data):
>     end =3D match.start()
>     print(start, end)
>     print(data[start:end])
>     start =3D match.end()
>=20
> This will omit the last text. The simplest fix is to put another "$"=20
> separator at the end of your data.