Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #24882 > unrolled thread
| Started by | subhabangalore@gmail.com |
|---|---|
| First post | 2012-07-04 16:21 -0700 |
| Last post | 2012-07-07 22:42 -0700 |
| Articles | 20 on this page of 27 — 8 participants |
Back to article view | Back to comp.lang.python
Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 16:21 -0700
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-05 00:02 +0000
Re: Discussion on some Code Issues Rick Johnson <rantingrickjohnson@gmail.com> - 2012-07-04 17:08 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 20:25 -0700
Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-05 09:30 +0200
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-06 09:35 +0200
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 12:54 -0700
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-07 16:51 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-08 18:03 +1000
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 03:17 +1000
Re: Discussion on some Code Issues Roy Smith <roy@panix.com> - 2012-07-08 14:17 -0400
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 07:54 +1000
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 00:57 +0000
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 18:41 +1000
Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 12:24 +0000
Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-10 00:47 +1000
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-09 12:49 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
Re: Discussion on some Code Issues MRAB <python@mrabarnett.plus.com> - 2012-07-08 19:27 +0100
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-08 15:07 -0400
Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
Page 1 of 2 [1] 2 Next page →
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-04 16:21 -0700 |
| Subject | Discussion on some Code Issues |
| Message-ID | <a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com> |
Dear Group,
I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
I got to code a bunch of documents which are combined together.
Like,
1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.
Now, the way I am processing is:
I am clubbing all the documents together, as,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words[i]=="$":
print (bag_words[i],i)
There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.
The confusion comes next,
As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?
There is no question on parsing it seems I am achieving it independent of length of the document.
If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?
Thanking You in Advance,
Best Regards,
Subhabrata Banerjee.
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-07-05 00:02 +0000 |
| Message-ID | <4ff4d9a3$0$29864$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #24882 |
On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:
[...]
> I got to code a bunch of documents which are combined together.
[...]
> The task is to separate the documents on the fly and to parse each of
> the documents with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
[...]
> But they are separated by a tag set
[...]
> To detect the document boundaries,
Let me see if I understand your problem.
You have a bunch of documents. You stick them all together into one
enormous lump. And then you try to detect the boundaries between one file
and the next within the enormous lump.
Why not just process each file separately? A simple for loop over the
list of files, before consolidating them into one giant file, will avoid
all the difficulty of trying to detect boundaries within files.
Instead of:
merge(output_filename, list_of_files)
for word in parse(output_filename):
if boundary_detected: do_something()
process(word)
Do this instead:
for filename in list_of_files:
do_something()
for word in parse(filename):
process(word)
> I am splitting them into a bag of
> words and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)
What happens if a file already has a $ in it?
> There is no issue. I am segmenting it nicely. I am using annotated
> corpus so applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length,
> and using slice. Works perfect.
I don't understand this. What sort of combinations do you think you need
to support? What are "I" values, and why are they important?
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Rick Johnson <rantingrickjohnson@gmail.com> |
|---|---|
| Date | 2012-07-04 17:08 -0700 |
| Message-ID | <59b3e7c3-83d8-43a7-9d24-1df682f0a353@j10g2000yqd.googlegroups.com> |
| In reply to | #24882 |
On Jul 4, 6:21 pm, subhabangal...@gmail.com wrote: > [...] > To detect the document boundaries, I am splitting them into a bag > of words and using a simple for loop as, > > for i in range(len(bag_words)): > if bag_words[i]=="$": > print (bag_words[i],i) Ignoring that you are attacking the problem incorrectly: that is very poor method of splitting a string since especially the Python gods have given you *power* over string objects. But you are going to have an even greater problem if the string contains a "$" char that you DID NOT insert :-O. You'd be wise to use a sep that is not likely to be in the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that approach is naive! Why not streamline the entire process and pass a list of file paths to a custom parser object instead?
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-04 20:25 -0700 |
| Message-ID | <34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com> |
| In reply to | #24882 |
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: > Dear Group, > > I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough. > > I got to code a bunch of documents which are combined together. > Like, > > 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing. > 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. > 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city. > > The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules. > > Now, the way I am processing is: > I am clubbing all the documents together, as, > > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city. > > But they are separated by a tag set, like, > A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$ > The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$ > A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city. > > To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as, > for i in range(len(bag_words)): > if bag_words[i]=="$": > print (bag_words[i],i) > > There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules. > > The confusion comes next, > > As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem? > > There is no question on parsing it seems I am achieving it independent of length of the document. > > If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how? > > Thanking You in Advance, > > Best Regards, > Subhabrata Banerjee. Hi Steven, It is nice to see your post. They are nice and I learnt so many things from you. "I" is for index of the loop. Now my clarification I thought to do "import os" and process files in a loop but that is not my problem statement. I have to make a big lump of text and detect one chunk. Looping over the line number of file I am not using because I may not be able to take the slices-this I need. I thought to give re.findall a try but that is not giving me the slices. Slice spreads here. The power issue of string! I would definitely give it a try. Happy Day Ahead Regards, Subhabrata Banerjee.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-07-05 09:30 +0200 |
| Message-ID | <mailman.1814.1341473418.4697.python-list@python.org> |
| In reply to | #24893 |
subhabangalore@gmail.com wrote:
> On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
>> Dear Group,
>>
>> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
>> discuss some coding issues. If any one of this learned room can shower
>> some light I would be helpful enough.
>>
>> I got to code a bunch of documents which are combined together.
>> Like,
>>
>> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing. 2) The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. 3) A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> The task is to separate the documents on the fly and to parse each of the
>> documents with a definite set of rules.
>>
>> Now, the way I am processing is:
>> I am clubbing all the documents together, as,
>>
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> But they are separated by a tag set, like,
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.$ The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection.$ A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> To detect the document boundaries, I am splitting them into a bag of
>> words and using a simple for loop as, for i in range(len(bag_words)):
>> if bag_words[i]=="$":
>> print (bag_words[i],i)
>>
>> There is no issue. I am segmenting it nicely. I am using annotated corpus
>> so applying parse rules.
>>
>> The confusion comes next,
>>
>> As per my problem statement the size of the file (of documents combined
>> together) won’t increase on the fly. So, just to support all kinds of
>> combinations I am appending in a list the “I” values, taking its length,
>> and using slice. Works perfect. Question is, is there a smarter way to
>> achieve this, and a curious question if the documents are on the fly with
>> no preprocessed tag set like “$” how may I do it? From a bunch without
>> EOF isn’t it a classification problem?
>>
>> There is no question on parsing it seems I am achieving it independent of
>> length of the document.
>>
>> If any one in the group can suggest how I am dealing with the problem and
>> which portions should be improved and how?
>>
>> Thanking You in Advance,
>>
>> Best Regards,
>> Subhabrata Banerjee.
>
>
> Hi Steven, It is nice to see your post. They are nice and I learnt so many
> things from you. "I" is for index of the loop. Now my clarification I
> thought to do "import os" and process files in a loop but that is not my
> problem statement. I have to make a big lump of text and detect one chunk.
> Looping over the line number of file I am not using because I may not be
> able to take the slices-this I need. I thought to give re.findall a try
> but that is not giving me the slices. Slice spreads here. The power issue
> of string! I would definitely give it a try. Happy Day Ahead Regards,
> Subhabrata Banerjee.
Then use re.finditer():
start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()
This will omit the last text. The simplest fix is to put another "$"
separator at the end of your data.
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-05 07:33 -0700 |
| Message-ID | <mailman.1828.1341498810.4697.python-list@python.org> |
| In reply to | #24899 |
Dear Peter, That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading. Best Regards, Subhabrata. On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote: > subhabangalore@gmail.com wrote: > > > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: > >> Dear Group, > >> > >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to > >> discuss some coding issues. If any one of this learned room can shower > >> some light I would be helpful enough. > >> > >> I got to code a bunch of documents which are combined together. > >> Like, > >> > >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing. 2) The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. 3) A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> The task is to separate the documents on the fly and to parse each of the > >> documents with a definite set of rules. > >> > >> Now, the way I am processing is: > >> I am clubbing all the documents together, as, > >> > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> But they are separated by a tag set, like, > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.$ The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection.$ A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> To detect the document boundaries, I am splitting them into a bag of > >> words and using a simple for loop as, for i in range(len(bag_words)): > >> if bag_words[i]=="$": > >> print (bag_words[i],i) > >> > >> There is no issue. I am segmenting it nicely. I am using annotated corpus > >> so applying parse rules. > >> > >> The confusion comes next, > >> > >> As per my problem statement the size of the file (of documents combined > >> together) won’t increase on the fly. So, just to support all kinds of > >> combinations I am appending in a list the “I” values, taking its length, > >> and using slice. Works perfect. Question is, is there a smarter way to > >> achieve this, and a curious question if the documents are on the fly with > >> no preprocessed tag set like “$” how may I do it? From a bunch without > >> EOF isn’t it a classification problem? > >> > >> There is no question on parsing it seems I am achieving it independent of > >> length of the document. > >> > >> If any one in the group can suggest how I am dealing with the problem and > >> which portions should be improved and how? > >> > >> Thanking You in Advance, > >> > >> Best Regards, > >> Subhabrata Banerjee. > > > > > > Hi Steven, It is nice to see your post. They are nice and I learnt so many > > things from you. "I" is for index of the loop. Now my clarification I > > thought to do "import os" and process files in a loop but that is not my > > problem statement. I have to make a big lump of text and detect one chunk. > > Looping over the line number of file I am not using because I may not be > > able to take the slices-this I need. I thought to give re.findall a try > > but that is not giving me the slices. Slice spreads here. The power issue > > of string! I would definitely give it a try. Happy Day Ahead Regards, > > Subhabrata Banerjee. > > Then use re.finditer(): > > start = 0 > for match in re.finditer(r"\$", data): > end = match.start() > print(start, end) > print(data[start:end]) > start = match.end() > > This will omit the last text. The simplest fix is to put another "$" > separator at the end of your data.
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-05 07:33 -0700 |
| Message-ID | <996c1d6a-f297-401e-94a4-99be159a0801@googlegroups.com> |
| In reply to | #24899 |
Dear Peter, That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading. Best Regards, Subhabrata. On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote: > subhabangalore@gmail.com wrote: > > > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote: > >> Dear Group, > >> > >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to > >> discuss some coding issues. If any one of this learned room can shower > >> some light I would be helpful enough. > >> > >> I got to code a bunch of documents which are combined together. > >> Like, > >> > >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing. 2) The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. 3) A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> The task is to separate the documents on the fly and to parse each of the > >> documents with a definite set of rules. > >> > >> Now, the way I am processing is: > >> I am clubbing all the documents together, as, > >> > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection. A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> But they are separated by a tag set, like, > >> A Mumbai-bound aircraft with 99 passengers on board was struck by > >> lightning on Tuesday evening that led to complete communication failure > >> in mid-air and forced the pilot to make an emergency landing.$ The > >> discovery of a new sub-atomic particle that is key to understanding how > >> the universe is built has an intrinsic Indian connection.$ A bomb > >> explosion outside a shopping mall here on Tuesday left no one injured, > >> but Nigerian authorities put security agencies on high alert fearing more > >> such attacks in the city. > >> > >> To detect the document boundaries, I am splitting them into a bag of > >> words and using a simple for loop as, for i in range(len(bag_words)): > >> if bag_words[i]=="$": > >> print (bag_words[i],i) > >> > >> There is no issue. I am segmenting it nicely. I am using annotated corpus > >> so applying parse rules. > >> > >> The confusion comes next, > >> > >> As per my problem statement the size of the file (of documents combined > >> together) won’t increase on the fly. So, just to support all kinds of > >> combinations I am appending in a list the “I” values, taking its length, > >> and using slice. Works perfect. Question is, is there a smarter way to > >> achieve this, and a curious question if the documents are on the fly with > >> no preprocessed tag set like “$” how may I do it? From a bunch without > >> EOF isn’t it a classification problem? > >> > >> There is no question on parsing it seems I am achieving it independent of > >> length of the document. > >> > >> If any one in the group can suggest how I am dealing with the problem and > >> which portions should be improved and how? > >> > >> Thanking You in Advance, > >> > >> Best Regards, > >> Subhabrata Banerjee. > > > > > > Hi Steven, It is nice to see your post. They are nice and I learnt so many > > things from you. "I" is for index of the loop. Now my clarification I > > thought to do "import os" and process files in a loop but that is not my > > problem statement. I have to make a big lump of text and detect one chunk. > > Looping over the line number of file I am not using because I may not be > > able to take the slices-this I need. I thought to give re.findall a try > > but that is not giving me the slices. Slice spreads here. The power issue > > of string! I would definitely give it a try. Happy Day Ahead Regards, > > Subhabrata Banerjee. > > Then use re.finditer(): > > start = 0 > for match in re.finditer(r"\$", data): > end = match.start() > print(start, end) > print(data[start:end]) > start = match.end() > > This will omit the last text. The simplest fix is to put another "$" > separator at the end of your data.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-07-06 09:35 +0200 |
| Message-ID | <mailman.1854.1341560131.4697.python-list@python.org> |
| In reply to | #24922 |
subhabangalore@gmail.com wrote:
[Please don't top-post]
>> start = 0
>> for match in re.finditer(r"\$", data):
>> end = match.start()
>> print(start, end)
>> print(data[start:end])
>> start = match.end()
> That is a nice one. I am thinking if I can write "for lines in f" sort of
> code that is easy but then how to find out the slices then,
You have to keep track both of the offset of the line and the offset within
the line:
def offsets(lines, pos=0):
for line in lines:
yield pos, line
pos += len(line)
start = 0
for line_start, line in offsets(lines):
for pos, part in offsets(re.split(r"(\$)", line), line_start):
if part == "$":
print(start, pos)
start = pos + 1
(untested code, I'm assuming that the file ends with a $)
> btw do you
> know in any case may I convert the index position of file to the list
> position provided I am writing the list for the same file we are reading.
Use a lookup list with the end positions of the texts and then find the
relevant text with bisect.
>>> ends = [10, 20, 50]
>>> filepos = 15
>>> bisect.bisect(ends, filepos)
1 # position 15 belongs to the second text
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-07 12:54 -0700 |
| Message-ID | <3c4e2ef9-bf7e-4fbc-bf12-6780fdc3e5d4@googlegroups.com> |
| In reply to | #24882 |
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
>
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
>
> I got to code a bunch of documents which are combined together.
> Like,
>
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
>
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> But they are separated by a tag set, like,
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)
>
> There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?
>
> There is no question on parsing it seems I am achieving it independent of length of the document.
>
> If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?
>
> Thanking You in Advance,
>
> Best Regards,
> Subhabrata Banerjee.
Thanks Peter but I feel your earlier one was better, I got an interesting one:
[i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]
But I am bit intrigued with another question,
suppose I say:
file_open=open("/python32/doc1.txt","r")
file=a1.read().lower()
for line in file:
line_word=line.split()
This works fine. But if I print it would be printed continuously.
I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
Is there any way out to this problem?
Regards,
Subhabrata Banerjee
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2012-07-07 16:51 -0400 |
| Message-ID | <mailman.1901.1341694286.4697.python-list@python.org> |
| In reply to | #25031 |
On Sat, 7 Jul 2012 12:54:16 -0700 (PDT), subhabangalore@gmail.com
declaimed the following in gmane.comp.python.general:
> But I am bit intrigued with another question,
>
> suppose I say:
> file_open=open("/python32/doc1.txt","r")
> file=a1.read().lower()
> for line in file:
> line_word=line.split()
>
> This works fine. But if I print it would be printed continuously.
"This works fine" -- Really?
1) Why are you storing data files in the install directory of your
Python interpreter?
2) "a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...
3) "file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file
4) "for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.
for line in file.split("\n"):
will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).
> I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
> Is there any way out to this problem?
>
>
> Regards,
> Subhabrata Banerjee
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-07 22:42 -0700 |
| Message-ID | <09adb3cf-f3f2-4acc-b561-a36dcf15ecc7@googlegroups.com> |
| In reply to | #25032 |
On Sunday, July 8, 2012 2:21:14 AM UTC+5:30, Dennis Lee Bieber wrote:
> On Sat, 7 Jul 2012 12:54:16 -0700 (PDT), subhabangalore@gmail.com
> declaimed the following in gmane.comp.python.general:
>
> > But I am bit intrigued with another question,
> >
> > suppose I say:
> > file_open=open("/python32/doc1.txt","r")
> > file=a1.read().lower()
> > for line in file:
> > line_word=line.split()
> >
> > This works fine. But if I print it would be printed continuously.
>
> "This works fine" -- Really?
>
> 1) Why are you storing data files in the install directory of your
> Python interpreter?
>
> 2) "a1" is undefined -- you should get an exception on that line which
> makes the following irrelevant; replacing "a1" with "file_open" leads
> to...
>
> 3) "file" is a) a predefined function in Python, which you have just
> shadowed and b) a poor name for a string containing the contents of a
> file
>
> 4) "for line in file", since "file" is a string, will iterate over EACH
> CHARACTER, meaning (since there is nothing to split) that "line_word" is
> also just a single character.
>
> for line in file.split("\n"):
>
> will split the STRING into logical lines (assuming a new-line character
> splits the lines) and permit the subsequent split to pull out wordS
> ("line_word" is misleading, as to will contain a LIST of words from the
> line).
>
> > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
> > Is there any way out to this problem?
> >
> >
> > Regards,
> > Subhabrata Banerjee
> --
> Wulfraed Dennis Lee Bieber AF6VN
> wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
file_open=open("/python32/doc1.txt","r")
for line in file_open:
line_word=line.split()
print (line_word)
To store them the best way is to assign a blank list and append but is there any alternate
method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.
Regards,
Subhabrata Banerjee.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-07-08 18:03 +1000 |
| Message-ID | <mailman.1910.1341734607.4697.python-list@python.org> |
| In reply to | #25035 |
On Sun, Jul 8, 2012 at 3:42 PM, <subhabangalore@gmail.com> wrote:
> Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
>
> file_open=open("/python32/doc1.txt","r")
> for line in file_open:
> line_word=line.split()
> print (line_word)
Yep. I'd be inclined to rename file_open to something that says what
the file _is_, and you may want to look into the 'with' statement to
guarantee timely closure of the file, but that's a way to do it.
Also, as has already been mentioned: keeping your data files in the
Python binaries directory isn't usually a good idea. More common to
keep them in the same directory as your script, which would mean that
you don't need a path on it at all.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2012-07-08 10:05 -0700 |
| Message-ID | <11832de7-a064-494e-b3e8-32a2f15a6902@googlegroups.com> |
| In reply to | #25038 |
On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
> On Sun, Jul 8, 2012 at 3:42 PM, <subhabangalore@gmail.com> wrote:
> > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
> >
> > file_open=open("/python32/doc1.txt","r")
> > for line in file_open:
> > line_word=line.split()
> > print (line_word)
>
> Yep. I'd be inclined to rename file_open to something that says what
> the file _is_, and you may want to look into the 'with' statement to
> guarantee timely closure of the file, but that's a way to do it.
>
> Also, as has already been mentioned: keeping your data files in the
> Python binaries directory isn't usually a good idea. More common to
> keep them in the same directory as your script, which would mean that
> you don't need a path on it at all.
>
> ChrisA
Dear Chirs,
No file path! Amazing. I do not know I like to know one small example please.
Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.
Regards,
Subha
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-07-09 03:17 +1000 |
| Message-ID | <mailman.1922.1341767824.4697.python-list@python.org> |
| In reply to | #25045 |
On Mon, Jul 9, 2012 at 3:05 AM, <subhabangalore@gmail.com> wrote:
> On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
>> On Sun, Jul 8, 2012 at 3:42 PM, <subhabangalore@gmail.com> wrote:
>> > file_open=open("/python32/doc1.txt","r")
>> Also, as has already been mentioned: keeping your data files in the
>> Python binaries directory isn't usually a good idea. More common to
>> keep them in the same directory as your script, which would mean that
>> you don't need a path on it at all.
> No file path! Amazing. I do not know I like to know one small example please.
open("doc1.txt","r")
Python will look for a file called doc1.txt in the directory you run
the script from (which is often going to be the same directory as your
.py program).
> Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.
I don't know what power() function you're talking about, and can't
find it in the previous posts; the nearest I can find is a post from
Ranting Rick which says a lot of guff that you can ignore. (Rick is a
professional troll. Occasionally he says something useful and
courteous; more often it's one or the other, or neither.)
As to the closing of files: There are a few narrow issues that make it
worth using the 'with' statement, such as exceptions; mostly, it's
just a good habit to get into. If you ignore it, your file will
*usually* be closed fairly soon after you stop referencing it, but
there's no guarantee. (Someone else will doubtless correct me if I'm
wrong, but I'm pretty sure Python guarantees to properly flush and
close on exit, but not necessarily before.)
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-07-08 14:17 -0400 |
| Message-ID | <roy-249FE5.14174108072012@news.panix.com> |
| In reply to | #25047 |
In article <mailman.1922.1341767824.4697.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote:
> open("doc1.txt","r")
>
> Python will look for a file called doc1.txt in the directory you run
> the script from (which is often going to be the same directory as your
> .py program).
Well, to pick a nit, the file will be looked for in the current working
directory. This may or may not be the directory you ran your script
from. Your script could have executed chdir() between the time you
started it and you tried to open the file.
To pick another nit, it's misleading to say, "Python will look for...".
This implies that Python somehow gets involved in pathname resolution,
when it doesn't. Python just passes paths to the operating system as
opaque strings, and the OS does all the magic of figuring out what that
string means.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-07-09 07:54 +1000 |
| Message-ID | <mailman.1930.1341784495.4697.python-list@python.org> |
| In reply to | #25048 |
On Mon, Jul 9, 2012 at 4:17 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.1922.1341767824.4697.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
>> open("doc1.txt","r")
>>
>> Python will look for a file called doc1.txt in the directory you run
>> the script from (which is often going to be the same directory as your
>> .py program).
>
> Well, to pick a nit, the file will be looked for in the current working
> directory. This may or may not be the directory you ran your script
> from. Your script could have executed chdir() between the time you
> started it and you tried to open the file.
>
> To pick another nit, it's misleading to say, "Python will look for...".
> This implies that Python somehow gets involved in pathname resolution,
> when it doesn't. Python just passes paths to the operating system as
> opaque strings, and the OS does all the magic of figuring out what that
> string means.
Two perfectly accurate nitpicks. And of course, there's a million and
one other things that could happen in between, too, including
possibilities of the current directory not even existing and so on. I
merely oversimplified in the hopes of giving a one-paragraph
explanation of what it means to not put a path name in your open()
call :) It's like the difference between reminder text on a Magic: The
Gathering card and the actual entries in the Comprehensive Rules.
Perfect example is the "Madness" ability - the reminder text explains
the ability, but uses language that actually is quite incorrect. It's
a better explanation, though.
Am I overanalyzing this? Yeah, probably...
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-07-09 00:57 +0000 |
| Message-ID | <4ffa2c7b$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #25056 |
On Mon, 09 Jul 2012 07:54:47 +1000, Chris Angelico wrote: > It's like > the difference between reminder text on a Magic: The Gathering card and > the actual entries in the Comprehensive Rules. Perfect example is the > "Madness" ability - the reminder text explains the ability, but uses > language that actually is quite incorrect. It's a better explanation, > though. Hang on, you say that an explanation which is "quite incorrect" is *better* than one which is correct? I can see why they call the card "Madness". :-P -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-07-09 18:41 +1000 |
| Message-ID | <mailman.1936.1341823291.4697.python-list@python.org> |
| In reply to | #25057 |
On Mon, Jul 9, 2012 at 10:57 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Mon, 09 Jul 2012 07:54:47 +1000, Chris Angelico wrote: > >> It's like >> the difference between reminder text on a Magic: The Gathering card and >> the actual entries in the Comprehensive Rules. Perfect example is the >> "Madness" ability - the reminder text explains the ability, but uses >> language that actually is quite incorrect. It's a better explanation, >> though. > > Hang on, you say that an explanation which is "quite incorrect" is > *better* than one which is correct? > > I can see why they call the card "Madness". > > :-P Agreed about the ability name :) The fact is, though, that when you're explaining something, it's often better to have a one-sentence explanation that's not quite technically accurate than two paragraphs explaining it in multiple steps and are opaque to anyone who doesn't have the rules-lawyer mind. (I happen to have such a mind. It's not always a good thing, but it makes me a better debugger.) Does it really hurt to anthropomorphize and say that "Python looks for modules in the directories in sys.path" instead of "Module lookup consists of iterating over the elements in sys.path [and that's leaving out the worst-case DFS where you explain THAT in detail], calling combine_path [or whatever it is] with the element and the module name, and attempting to stat/open the result"? While your listener's getting bogged down in unnecessary detail, s/he isn't grokking the overall purpose of what you're saying. One option is more accurate. The other is far more helpful. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-07-09 12:24 +0000 |
| Message-ID | <4ffacd61$0$29988$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #25072 |
On Mon, 09 Jul 2012 18:41:28 +1000, Chris Angelico wrote: > Does it really hurt to anthropomorphize Don't anthropomorphise computers. They don't like it when you do. > and say that "Python looks for > modules in the directories in sys.path" instead of "Module lookup > consists of iterating blah blah blah yadda watermelon yadda blah". I don't think so, I often talk about Python looking for files myself. The intentional stance is an incredibly powerful technique for understanding behaviour of all sorts of entities, sentient or not, from DNA to computers to corporations, and even people. But it does depend on context. Sometimes you need more detail than just "Python looks". You need to know precisely *how* Python looks, and how it decides whether it has found or not. And note that I'm still using the intentional stance. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-07-10 00:47 +1000 |
| Message-ID | <mailman.1949.1341845269.4697.python-list@python.org> |
| In reply to | #25079 |
On Mon, Jul 9, 2012 at 10:24 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > But it does depend on context. Sometimes you need more detail than just > "Python looks". You need to know precisely *how* Python looks, and how it > decides whether it has found or not. Agreed. So, looking back at the original context: A question was posed that isn't really about Python at all, but more about file systems. I gave a simple one-sentence answer that omitted heaps of details. It didn't seem likely that someone confused by path names would be changing the current directory inside the script, nor that the distinction of who evaluates a path would be significant (how often does _anyone_ care whether your path is parsed by Python, by the OS, or by the underlying file system?). ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web