Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #24921
| Newsgroups | comp.lang.python |
|---|---|
| Date | 2012-07-05 07:33 -0700 |
| References | <a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com> <34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com> <mailman.1814.1341473418.4697.python-list@python.org> |
| Subject | Re: Discussion on some Code Issues |
| From | subhabangalore@gmail.com |
| Message-ID | <mailman.1828.1341498810.4697.python-list@python.org> (permalink) |
Dear Peter, That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading. Best Regards, Subhabrata. On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote: > subhabangalore@gmail.com wrote: > > > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: > >> Dear Group, > >> > >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to > >> discuss some coding issues. If any one of this learned room can shower > >> some light I would be helpful enough. > >> > >> I got to code a bunch of documents which are combined together. > >> Like, > >> > >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing. 2) The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. 3) A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> The task is to separate the documents on the fly and to parse each of the > >> documents with a definite set of rules. > >> > >> Now, the way I am processing is: > >> I am clubbing all the documents together, as, > >> > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> But they are separated by a tag set, like, > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.$ The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection.$ A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> To detect the document boundaries, I am splitting them into a bag of > >> words and using a simple for loop as, for i in range(len(bag_words)): > >> if bag_words[i]=="$": > >> print (bag_words[i],i) > >> > >> There is no issue. I am segmenting it nicely. I am using annotated corpus > >> so applying parse rules. > >> > >> The confusion comes next, > >> > >> As per my problem statement the size of the file (of documents combined > >> together) won’t increase on the fly. So, just to support all kinds of > >> combinations I am appending in a list the “I” values, taking its length, > >> and using slice. Works perfect. Question is, is there a smarter way to > >> achieve this, and a curious question if the documents are on the fly with > >> no preprocessed tag set like “$” how may I do it? From a bunch without > >> EOF isn’t it a classification problem? > >> > >> There is no question on parsing it seems I am achieving it independent of > >> length of the document. > >> > >> If any one in the group can suggest how I am dealing with the problem and > >> which portions should be improved and how? > >> > >> Thanking You in Advance, > >> > >> Best Regards, > >> Subhabrata Banerjee. > > > > > > Hi Steven, It is nice to see your post. They are nice and I learnt so many > > things from you. "I" is for index of the loop. Now my clarification I > > thought to do "import os" and process files in a loop but that is not my > > problem statement. I have to make a big lump of text and detect one chunk. > > Looping over the line number of file I am not using because I may not be > > able to take the slices-this I need. I thought to give re.findall a try > > but that is not giving me the slices. Slice spreads here. The power issue > > of string! I would definitely give it a try. Happy Day Ahead Regards, > > Subhabrata Banerjee. > > Then use re.finditer(): > > start = 0 > for match in re.finditer(r"\$", data): > end = match.start() > print(start, end) > print(data[start:end]) > start = match.end() > > This will omit the last text. The simplest fix is to put another "$" > separator at the end of your data.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 16:21 -0700
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-05 00:02 +0000
Re: Discussion on some Code Issues Rick Johnson <rantingrickjohnson@gmail.com> - 2012-07-04 17:08 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 20:25 -0700
Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-05 09:30 +0200
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-06 09:35 +0200
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 12:54 -0700
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-07 16:51 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-08 18:03 +1000
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 03:17 +1000
Re: Discussion on some Code Issues Roy Smith <roy@panix.com> - 2012-07-08 14:17 -0400
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 07:54 +1000
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 00:57 +0000
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 18:41 +1000
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 12:24 +0000
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-10 00:47 +1000
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-09 12:49 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
Re: Discussion on some Code Issues MRAB <python@mrabarnett.plus.com> - 2012-07-08 19:27 +0100
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-08 15:07 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
csiph-web