Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #24882 > unrolled thread

Discussion on some Code Issues

Started bysubhabangalore@gmail.com
First post2012-07-04 16:21 -0700
Last post2012-07-07 22:42 -0700
Articles 20 on this page of 27 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 16:21 -0700
    Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-05 00:02 +0000
    Re: Discussion on some Code Issues Rick Johnson <rantingrickjohnson@gmail.com> - 2012-07-04 17:08 -0700
    Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 20:25 -0700
      Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-05 09:30 +0200
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
          Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-06 09:35 +0200
    Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 12:54 -0700
      Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-07 16:51 -0400
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
          Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-08 18:03 +1000
            Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
              Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 03:17 +1000
                Re: Discussion on some Code Issues Roy Smith <roy@panix.com> - 2012-07-08 14:17 -0400
                  Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 07:54 +1000
                    Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 00:57 +0000
                      Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 18:41 +1000
                        Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 12:24 +0000
                          Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-10 00:47 +1000
                      Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-09 12:49 -0400
                Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
                Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
              Re: Discussion on some Code Issues MRAB <python@mrabarnett.plus.com> - 2012-07-08 19:27 +0100
            Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
          Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-08 15:07 -0400
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700

Page 1 of 2  [1] 2  Next page →


#24882 — Discussion on some Code Issues

Fromsubhabangalore@gmail.com
Date2012-07-04 16:21 -0700
SubjectDiscussion on some Code Issues
Message-ID<a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com>
Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough. 

I got to code a bunch of documents  which are combined together. 
Like, 

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules. 

Now, the way I am processing is: 
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like, 
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as, 
for i in range(len(bag_words)):
        if bag_words[i]=="$":
            print (bag_words[i],i)

There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules. 

The confusion comes next, 

As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem? 

There is no question on parsing it seems I am achieving it independent of length of the document. 

If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee. 

[toc] | [next] | [standalone]


#24883

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-07-05 00:02 +0000
Message-ID<4ff4d9a3$0$29864$c3e8da3$5496439d@news.astraweb.com>
In reply to#24882
On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

[...]
> I got to code a bunch of documents  which are combined together.
[...]
> The task is to separate the documents on the fly and to parse each of
> the documents with a definite set of rules.
> 
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
[...]
> But they are separated by a tag set
[...] 
> To detect the document boundaries,

Let me see if I understand your problem.

You have a bunch of documents. You stick them all together into one 
enormous lump. And then you try to detect the boundaries between one file 
and the next within the enormous lump.

Why not just process each file separately? A simple for loop over the 
list of files, before consolidating them into one giant file, will avoid 
all the difficulty of trying to detect boundaries within files.

Instead of:

merge(output_filename,  list_of_files)
for word in parse(output_filename):
    if boundary_detected: do_something()
    process(word)

Do this instead:

for filename in  list_of_files:
    do_something()
    for word in parse(filename):
        process(word)


> I am splitting them into a bag of
> words and using a simple for loop as, 
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)


What happens if a file already has a $ in it?


> There is no issue. I am segmenting it nicely. I am using annotated
> corpus so applying parse rules.
> 
> The confusion comes next,
> 
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length,
> and using slice. Works perfect.

I don't understand this. What sort of combinations do you think you need 
to support? What are "I" values, and why are they important?



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#24884

FromRick Johnson <rantingrickjohnson@gmail.com>
Date2012-07-04 17:08 -0700
Message-ID<59b3e7c3-83d8-43a7-9d24-1df682f0a353@j10g2000yqd.googlegroups.com>
In reply to#24882
On Jul 4, 6:21 pm, subhabangal...@gmail.com wrote:
> [...]
> To detect the document boundaries, I am splitting them into a bag
> of words and using a simple for loop as,
>
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)

Ignoring that you are attacking the problem incorrectly: that is very
poor method of splitting a string since especially the Python gods
have given you *power* over string objects. But you are going to have
an even greater problem if the string contains a "$" char that you DID
NOT insert :-O. You'd be wise to use a sep that is not likely to be in
the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that
approach is naive! Why not streamline the entire process and pass a
list of file paths to a custom parser object instead?

[toc] | [prev] | [next] | [standalone]


#24893

Fromsubhabangalore@gmail.com
Date2012-07-04 20:25 -0700
Message-ID<34484d3d-d4c2-463b-8f83-dba57ce0511d@googlegroups.com>
In reply to#24882
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
> 
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough. 
> 
> I got to code a bunch of documents  which are combined together. 
> Like, 
> 
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules. 
> 
> Now, the way I am processing is: 
> I am clubbing all the documents together, as,
> 
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> But they are separated by a tag set, like, 
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as, 
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)
> 
> There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules. 
> 
> The confusion comes next, 
> 
> As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem? 
> 
> There is no question on parsing it seems I am achieving it independent of length of the document. 
> 
> If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?
> 
> Thanking You in Advance,
> 
> Best Regards,
> Subhabrata Banerjee.


Hi Steven, It is nice to see your post. They are nice and I learnt so many things from you. "I" is for index of the loop.
Now my clarification I thought to do "import os" and process files in a loop but that is not my problem statement. I have to make a big lump of text and detect one chunk. Looping over the line number of file I am not using because I may not be able to take the slices-this I need. I thought to give re.findall a try but that is not giving me the slices. Slice spreads here. The power issue of string! I would definitely give it a try. Happy Day Ahead Regards, Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]


#24899

FromPeter Otten <__peter__@web.de>
Date2012-07-05 09:30 +0200
Message-ID<mailman.1814.1341473418.4697.python-list@python.org>
In reply to#24893
subhabangalore@gmail.com wrote:

> On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
>> Dear Group,
>> 
>> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
>> discuss some coding issues. If any one of this learned room can shower
>> some light I would be helpful enough.
>> 
>> I got to code a bunch of documents  which are combined together.
>> Like,
>> 
>> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing. 2) The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. 3) A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> The task is to separate the documents on the fly and to parse each of the
>> documents with a definite set of rules.
>> 
>> Now, the way I am processing is:
>> I am clubbing all the documents together, as,
>> 
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> But they are separated by a tag set, like,
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.$ The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection.$ A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>> 
>> To detect the document boundaries, I am splitting them into a bag of
>> words and using a simple for loop as, for i in range(len(bag_words)):
>>         if bag_words[i]=="$":
>>             print (bag_words[i],i)
>> 
>> There is no issue. I am segmenting it nicely. I am using annotated corpus
>> so applying parse rules.
>> 
>> The confusion comes next,
>> 
>> As per my problem statement the size of the file (of documents combined
>> together) won’t increase on the fly. So, just to support all kinds of
>> combinations I am appending in a list the “I” values, taking its length,
>> and using slice. Works perfect. Question is, is there a smarter way to
>> achieve this, and a curious question if the documents are on the fly with
>> no preprocessed tag set like “$” how may I do it? From a bunch without
>> EOF isn’t it a classification problem?
>> 
>> There is no question on parsing it seems I am achieving it independent of
>> length of the document.
>> 
>> If any one in the group can suggest how I am dealing with the problem and
>> which portions should be improved and how?
>> 
>> Thanking You in Advance,
>> 
>> Best Regards,
>> Subhabrata Banerjee.
> 
> 
> Hi Steven, It is nice to see your post. They are nice and I learnt so many
> things from you. "I" is for index of the loop. Now my clarification I
> thought to do "import os" and process files in a loop but that is not my
> problem statement. I have to make a big lump of text and detect one chunk.
> Looping over the line number of file I am not using because I may not be
> able to take the slices-this I need. I thought to give re.findall a try
> but that is not giving me the slices. Slice spreads here. The power issue
> of string! I would definitely give it a try. Happy Day Ahead Regards,
> Subhabrata Banerjee.

Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
    end = match.start()
    print(start, end)
    print(data[start:end])
    start = match.end()

This will omit the last text. The simplest fix is to put another "$" 
separator at the end of your data.

[toc] | [prev] | [next] | [standalone]


#24921

Fromsubhabangalore@gmail.com
Date2012-07-05 07:33 -0700
Message-ID<mailman.1828.1341498810.4697.python-list@python.org>
In reply to#24899
Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading. 

Best Regards,
Subhabrata. 

On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
> subhabangalore@gmail.com wrote:
> 
> > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> >> Dear Group,
> >> 
> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
> >> discuss some coding issues. If any one of this learned room can shower
> >> some light I would be helpful enough.
> >> 
> >> I got to code a bunch of documents  which are combined together.
> >> Like,
> >> 
> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing. 2) The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. 3) A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> The task is to separate the documents on the fly and to parse each of the
> >> documents with a definite set of rules.
> >> 
> >> Now, the way I am processing is:
> >> I am clubbing all the documents together, as,
> >> 
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> But they are separated by a tag set, like,
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.$ The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection.$ A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> To detect the document boundaries, I am splitting them into a bag of
> >> words and using a simple for loop as, for i in range(len(bag_words)):
> >>         if bag_words[i]=="$":
> >>             print (bag_words[i],i)
> >> 
> >> There is no issue. I am segmenting it nicely. I am using annotated corpus
> >> so applying parse rules.
> >> 
> >> The confusion comes next,
> >> 
> >> As per my problem statement the size of the file (of documents combined
> >> together) won’t increase on the fly. So, just to support all kinds of
> >> combinations I am appending in a list the “I” values, taking its length,
> >> and using slice. Works perfect. Question is, is there a smarter way to
> >> achieve this, and a curious question if the documents are on the fly with
> >> no preprocessed tag set like “$” how may I do it? From a bunch without
> >> EOF isn’t it a classification problem?
> >> 
> >> There is no question on parsing it seems I am achieving it independent of
> >> length of the document.
> >> 
> >> If any one in the group can suggest how I am dealing with the problem and
> >> which portions should be improved and how?
> >> 
> >> Thanking You in Advance,
> >> 
> >> Best Regards,
> >> Subhabrata Banerjee.
> > 
> > 
> > Hi Steven, It is nice to see your post. They are nice and I learnt so many
> > things from you. "I" is for index of the loop. Now my clarification I
> > thought to do "import os" and process files in a loop but that is not my
> > problem statement. I have to make a big lump of text and detect one chunk.
> > Looping over the line number of file I am not using because I may not be
> > able to take the slices-this I need. I thought to give re.findall a try
> > but that is not giving me the slices. Slice spreads here. The power issue
> > of string! I would definitely give it a try. Happy Day Ahead Regards,
> > Subhabrata Banerjee.
> 
> Then use re.finditer():
> 
> start = 0
> for match in re.finditer(r"\$", data):
>     end = match.start()
>     print(start, end)
>     print(data[start:end])
>     start = match.end()
> 
> This will omit the last text. The simplest fix is to put another "$" 
> separator at the end of your data.

[toc] | [prev] | [next] | [standalone]


#24922

Fromsubhabangalore@gmail.com
Date2012-07-05 07:33 -0700
Message-ID<996c1d6a-f297-401e-94a4-99be159a0801@googlegroups.com>
In reply to#24899
Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading. 

Best Regards,
Subhabrata. 

On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
> subhabangalore@gmail.com wrote:
> 
> > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> >> Dear Group,
> >> 
> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
> >> discuss some coding issues. If any one of this learned room can shower
> >> some light I would be helpful enough.
> >> 
> >> I got to code a bunch of documents  which are combined together.
> >> Like,
> >> 
> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing. 2) The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. 3) A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> The task is to separate the documents on the fly and to parse each of the
> >> documents with a definite set of rules.
> >> 
> >> Now, the way I am processing is:
> >> I am clubbing all the documents together, as,
> >> 
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> But they are separated by a tag set, like,
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.$ The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection.$ A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >> 
> >> To detect the document boundaries, I am splitting them into a bag of
> >> words and using a simple for loop as, for i in range(len(bag_words)):
> >>         if bag_words[i]=="$":
> >>             print (bag_words[i],i)
> >> 
> >> There is no issue. I am segmenting it nicely. I am using annotated corpus
> >> so applying parse rules.
> >> 
> >> The confusion comes next,
> >> 
> >> As per my problem statement the size of the file (of documents combined
> >> together) won’t increase on the fly. So, just to support all kinds of
> >> combinations I am appending in a list the “I” values, taking its length,
> >> and using slice. Works perfect. Question is, is there a smarter way to
> >> achieve this, and a curious question if the documents are on the fly with
> >> no preprocessed tag set like “$” how may I do it? From a bunch without
> >> EOF isn’t it a classification problem?
> >> 
> >> There is no question on parsing it seems I am achieving it independent of
> >> length of the document.
> >> 
> >> If any one in the group can suggest how I am dealing with the problem and
> >> which portions should be improved and how?
> >> 
> >> Thanking You in Advance,
> >> 
> >> Best Regards,
> >> Subhabrata Banerjee.
> > 
> > 
> > Hi Steven, It is nice to see your post. They are nice and I learnt so many
> > things from you. "I" is for index of the loop. Now my clarification I
> > thought to do "import os" and process files in a loop but that is not my
> > problem statement. I have to make a big lump of text and detect one chunk.
> > Looping over the line number of file I am not using because I may not be
> > able to take the slices-this I need. I thought to give re.findall a try
> > but that is not giving me the slices. Slice spreads here. The power issue
> > of string! I would definitely give it a try. Happy Day Ahead Regards,
> > Subhabrata Banerjee.
> 
> Then use re.finditer():
> 
> start = 0
> for match in re.finditer(r"\$", data):
>     end = match.start()
>     print(start, end)
>     print(data[start:end])
>     start = match.end()
> 
> This will omit the last text. The simplest fix is to put another "$" 
> separator at the end of your data.

[toc] | [prev] | [next] | [standalone]


#24960

FromPeter Otten <__peter__@web.de>
Date2012-07-06 09:35 +0200
Message-ID<mailman.1854.1341560131.4697.python-list@python.org>
In reply to#24922
subhabangalore@gmail.com wrote:

[Please don't top-post]

>> start = 0
>> for match in re.finditer(r"\$", data):
>>     end = match.start()
>>     print(start, end)
>>     print(data[start:end])
>>     start = match.end()

> That is a nice one. I am thinking if I can write "for lines in f" sort of
> code that is easy but then how to find out the slices then, 

You have to keep track both of the offset of the line and the offset within 
the line:

def offsets(lines, pos=0):
    for line in lines:
        yield pos, line
        pos += len(line)

start = 0
for line_start, line in offsets(lines):
    for pos, part in offsets(re.split(r"(\$)", line), line_start):
        if part == "$":
            print(start, pos)
            start = pos + 1

(untested code, I'm assuming that the file ends with a $)

> btw do you
> know in any case may I convert the index position of file to the list
> position provided I am writing the list for the same file we are reading.

Use a lookup list with the end positions of the texts and then find the 
relevant text with bisect.

>>> ends = [10, 20, 50]
>>> filepos = 15
>>> bisect.bisect(ends, filepos)
1 # position 15 belongs to the second text

[toc] | [prev] | [next] | [standalone]


#25031

Fromsubhabangalore@gmail.com
Date2012-07-07 12:54 -0700
Message-ID<3c4e2ef9-bf7e-4fbc-bf12-6780fdc3e5d4@googlegroups.com>
In reply to#24882
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
> 
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough. 
> 
> I got to code a bunch of documents  which are combined together. 
> Like, 
> 
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules. 
> 
> Now, the way I am processing is: 
> I am clubbing all the documents together, as,
> 
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> But they are separated by a tag set, like, 
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
> 
> To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as, 
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)
> 
> There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules. 
> 
> The confusion comes next, 
> 
> As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem? 
> 
> There is no question on parsing it seems I am achieving it independent of length of the document. 
> 
> If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?
> 
> Thanking You in Advance,
> 
> Best Regards,
> Subhabrata Banerjee.

Thanks Peter but I feel your earlier one was better, I got an interesting one:
[i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]

But I am bit intrigued with another question,

suppose I say:
  file_open=open("/python32/doc1.txt","r")
  file=a1.read().lower()
  for line in file:
       line_word=line.split()

This works fine. But if I print it would be printed continuously.
I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
Is there any way out to this problem?


Regards,
Subhabrata Banerjee

[toc] | [prev] | [next] | [standalone]


#25032

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2012-07-07 16:51 -0400
Message-ID<mailman.1901.1341694286.4697.python-list@python.org>
In reply to#25031
On Sat, 7 Jul 2012 12:54:16 -0700 (PDT), subhabangalore@gmail.com
declaimed the following in gmane.comp.python.general:

> But I am bit intrigued with another question,
> 
> suppose I say:
>   file_open=open("/python32/doc1.txt","r")
>   file=a1.read().lower()
>   for line in file:
>        line_word=line.split()
> 
> This works fine. But if I print it would be printed continuously.

	"This works fine" -- Really?

1)	Why are you storing data files in the install directory of your
Python interpreter?

2)	"a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...

3)	"file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file

4) 	"for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.

	for line in file.split("\n"):

will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).

> I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
> Is there any way out to this problem?
> 
> 
> Regards,
> Subhabrata Banerjee
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#25035

Fromsubhabangalore@gmail.com
Date2012-07-07 22:42 -0700
Message-ID<09adb3cf-f3f2-4acc-b561-a36dcf15ecc7@googlegroups.com>
In reply to#25032
On Sunday, July 8, 2012 2:21:14 AM UTC+5:30, Dennis Lee Bieber wrote:
> On Sat, 7 Jul 2012 12:54:16 -0700 (PDT), subhabangalore@gmail.com
> declaimed the following in gmane.comp.python.general:
> 
> > But I am bit intrigued with another question,
> > 
> > suppose I say:
> >   file_open=open("/python32/doc1.txt","r")
> >   file=a1.read().lower()
> >   for line in file:
> >        line_word=line.split()
> > 
> > This works fine. But if I print it would be printed continuously.
> 
> 	"This works fine" -- Really?
> 
> 1)	Why are you storing data files in the install directory of your
> Python interpreter?
> 
> 2)	"a1" is undefined -- you should get an exception on that line which
> makes the following irrelevant; replacing "a1" with "file_open" leads
> to...
> 
> 3)	"file" is a) a predefined function in Python, which you have just
> shadowed and b) a poor name for a string containing the contents of a
> file
> 
> 4) 	"for line in file", since "file" is a string, will iterate over EACH
> CHARACTER, meaning (since there is nothing to split) that "line_word" is
> also just a single character.
> 
> 	for line in file.split("\n"):
> 
> will split the STRING into logical lines (assuming a new-line character
> splits the lines) and permit the subsequent split to pull out wordS
> ("line_word" is misleading, as to will contain a LIST of words from the
> line).
> 
> > I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
> > Is there any way out to this problem?
> > 
> > 
> > Regards,
> > Subhabrata Banerjee
> -- 
> 	Wulfraed                 Dennis Lee Bieber         AF6VN
>         wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,

file_open=open("/python32/doc1.txt","r")
for line in file_open:
         line_word=line.split()
         print (line_word)

To store them the best way is to assign a blank list and append but is there any alternate
method for huge data it becomes tough as the list becomes huge if any way variables may be assigned.

Regards,
Subhabrata Banerjee.

[toc] | [prev] | [next] | [standalone]


#25038

FromChris Angelico <rosuav@gmail.com>
Date2012-07-08 18:03 +1000
Message-ID<mailman.1910.1341734607.4697.python-list@python.org>
In reply to#25035
On Sun, Jul 8, 2012 at 3:42 PM,  <subhabangalore@gmail.com> wrote:
> Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
>
> file_open=open("/python32/doc1.txt","r")
> for line in file_open:
>          line_word=line.split()
>          print (line_word)

Yep. I'd be inclined to rename file_open to something that says what
the file _is_, and you may want to look into the 'with' statement to
guarantee timely closure of the file, but that's a way to do it.

Also, as has already been mentioned: keeping your data files in the
Python binaries directory isn't usually a good idea. More common to
keep them in the same directory as your script, which would mean that
you don't need a path on it at all.

ChrisA

[toc] | [prev] | [next] | [standalone]


#25045

Fromsubhabangalore@gmail.com
Date2012-07-08 10:05 -0700
Message-ID<11832de7-a064-494e-b3e8-32a2f15a6902@googlegroups.com>
In reply to#25038
On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
> On Sun, Jul 8, 2012 at 3:42 PM,  <subhabangalore@gmail.com> wrote:
> > Thanks for pointing out the mistakes. Your points are right. So I am trying to revise it,
> >
> > file_open=open("/python32/doc1.txt","r")
> > for line in file_open:
> >          line_word=line.split()
> >          print (line_word)
> 
> Yep. I'd be inclined to rename file_open to something that says what
> the file _is_, and you may want to look into the 'with' statement to
> guarantee timely closure of the file, but that's a way to do it.
> 
> Also, as has already been mentioned: keeping your data files in the
> Python binaries directory isn't usually a good idea. More common to
> keep them in the same directory as your script, which would mean that
> you don't need a path on it at all.
> 
> ChrisA

Dear Chirs,
No file path! Amazing. I do not know I like to know one small example please. 
Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.  

Regards,
Subha

[toc] | [prev] | [next] | [standalone]


#25047

FromChris Angelico <rosuav@gmail.com>
Date2012-07-09 03:17 +1000
Message-ID<mailman.1922.1341767824.4697.python-list@python.org>
In reply to#25045
On Mon, Jul 9, 2012 at 3:05 AM,  <subhabangalore@gmail.com> wrote:
> On Sunday, July 8, 2012 1:33:25 PM UTC+5:30, Chris Angelico wrote:
>> On Sun, Jul 8, 2012 at 3:42 PM,  <subhabangalore@gmail.com> wrote:
>> > file_open=open("/python32/doc1.txt","r")
>> Also, as has already been mentioned: keeping your data files in the
>> Python binaries directory isn't usually a good idea. More common to
>> keep them in the same directory as your script, which would mean that
>> you don't need a path on it at all.
> No file path! Amazing. I do not know I like to know one small example please.

open("doc1.txt","r")

Python will look for a file called doc1.txt in the directory you run
the script from (which is often going to be the same directory as your
.py program).

> Btw, some earlier post said, line.split() to convert line into bag of words can be done with power(), but I did not find it, if any one can help. I do close files do not worry. New style I'd try.

I don't know what power() function you're talking about, and can't
find it in the previous posts; the nearest I can find is a post from
Ranting Rick which says a lot of guff that you can ignore. (Rick is a
professional troll. Occasionally he says something useful and
courteous; more often it's one or the other, or neither.)

As to the closing of files: There are a few narrow issues that make it
worth using the 'with' statement, such as exceptions; mostly, it's
just a good habit to get into. If you ignore it, your file will
*usually* be closed fairly soon after you stop referencing it, but
there's no guarantee. (Someone else will doubtless correct me if I'm
wrong, but I'm pretty sure Python guarantees to properly flush and
close on exit, but not necessarily before.)

ChrisA

[toc] | [prev] | [next] | [standalone]


#25048

FromRoy Smith <roy@panix.com>
Date2012-07-08 14:17 -0400
Message-ID<roy-249FE5.14174108072012@news.panix.com>
In reply to#25047
In article <mailman.1922.1341767824.4697.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> open("doc1.txt","r")
> 
> Python will look for a file called doc1.txt in the directory you run
> the script from (which is often going to be the same directory as your
> .py program).

Well, to pick a nit, the file will be looked for in the current working 
directory.  This may or may not be the directory you ran your script 
from.  Your script could have executed chdir() between the time you 
started it and you tried to open the file.

To pick another nit, it's misleading to say, "Python will look for...".  
This implies that Python somehow gets involved in pathname resolution, 
when it doesn't.  Python just passes paths to the operating system as 
opaque strings, and the OS does all the magic of figuring out what that 
string means.

[toc] | [prev] | [next] | [standalone]


#25056

FromChris Angelico <rosuav@gmail.com>
Date2012-07-09 07:54 +1000
Message-ID<mailman.1930.1341784495.4697.python-list@python.org>
In reply to#25048
On Mon, Jul 9, 2012 at 4:17 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.1922.1341767824.4697.python-list@python.org>,
>  Chris Angelico <rosuav@gmail.com> wrote:
>
>> open("doc1.txt","r")
>>
>> Python will look for a file called doc1.txt in the directory you run
>> the script from (which is often going to be the same directory as your
>> .py program).
>
> Well, to pick a nit, the file will be looked for in the current working
> directory.  This may or may not be the directory you ran your script
> from.  Your script could have executed chdir() between the time you
> started it and you tried to open the file.
>
> To pick another nit, it's misleading to say, "Python will look for...".
> This implies that Python somehow gets involved in pathname resolution,
> when it doesn't.  Python just passes paths to the operating system as
> opaque strings, and the OS does all the magic of figuring out what that
> string means.

Two perfectly accurate nitpicks. And of course, there's a million and
one other things that could happen in between, too, including
possibilities of the current directory not even existing and so on. I
merely oversimplified in the hopes of giving a one-paragraph
explanation of what it means to not put a path name in your open()
call :) It's like the difference between reminder text on a Magic: The
Gathering card and the actual entries in the Comprehensive Rules.
Perfect example is the "Madness" ability - the reminder text explains
the ability, but uses language that actually is quite incorrect. It's
a better explanation, though.

Am I overanalyzing this? Yeah, probably...

ChrisA

[toc] | [prev] | [next] | [standalone]


#25057

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-07-09 00:57 +0000
Message-ID<4ffa2c7b$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to#25056
On Mon, 09 Jul 2012 07:54:47 +1000, Chris Angelico wrote:

> It's like
> the difference between reminder text on a Magic: The Gathering card and
> the actual entries in the Comprehensive Rules. Perfect example is the
> "Madness" ability - the reminder text explains the ability, but uses
> language that actually is quite incorrect. It's a better explanation,
> though.

Hang on, you say that an explanation which is "quite incorrect" is 
*better* than one which is correct?

I can see why they call the card "Madness".

:-P



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#25072

FromChris Angelico <rosuav@gmail.com>
Date2012-07-09 18:41 +1000
Message-ID<mailman.1936.1341823291.4697.python-list@python.org>
In reply to#25057
On Mon, Jul 9, 2012 at 10:57 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 09 Jul 2012 07:54:47 +1000, Chris Angelico wrote:
>
>> It's like
>> the difference between reminder text on a Magic: The Gathering card and
>> the actual entries in the Comprehensive Rules. Perfect example is the
>> "Madness" ability - the reminder text explains the ability, but uses
>> language that actually is quite incorrect. It's a better explanation,
>> though.
>
> Hang on, you say that an explanation which is "quite incorrect" is
> *better* than one which is correct?
>
> I can see why they call the card "Madness".
>
> :-P

Agreed about the ability name :) The fact is, though, that when you're
explaining something, it's often better to have a one-sentence
explanation that's not quite technically accurate than two paragraphs
explaining it in multiple steps and are opaque to anyone who doesn't
have the rules-lawyer mind. (I happen to have such a mind. It's not
always a good thing, but it makes me a better debugger.)

Does it really hurt to anthropomorphize and say that "Python looks for
modules in the directories in sys.path" instead of "Module lookup
consists of iterating over the elements in sys.path [and that's
leaving out the worst-case DFS where you explain THAT in detail],
calling combine_path [or whatever it is] with the element and the
module name, and attempting to stat/open the result"? While your
listener's getting bogged down in unnecessary detail, s/he isn't
grokking the overall purpose of what you're saying.

One option is more accurate. The other is far more helpful.

ChrisA

[toc] | [prev] | [next] | [standalone]


#25079

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2012-07-09 12:24 +0000
Message-ID<4ffacd61$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to#25072
On Mon, 09 Jul 2012 18:41:28 +1000, Chris Angelico wrote:

> Does it really hurt to anthropomorphize 

Don't anthropomorphise computers. They don't like it when you do.


> and say that "Python looks for
> modules in the directories in sys.path" instead of "Module lookup
> consists of iterating blah blah blah yadda watermelon yadda blah".

I don't think so, I often talk about Python looking for files myself. The 
intentional stance is an incredibly powerful technique for understanding 
behaviour of all sorts of entities, sentient or not, from DNA to 
computers to corporations, and even people.

But it does depend on context. Sometimes you need more detail than just 
"Python looks". You need to know precisely *how* Python looks, and how it 
decides whether it has found or not. 

And note that I'm still using the intentional stance.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#25085

FromChris Angelico <rosuav@gmail.com>
Date2012-07-10 00:47 +1000
Message-ID<mailman.1949.1341845269.4697.python-list@python.org>
In reply to#25079
On Mon, Jul 9, 2012 at 10:24 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> But it does depend on context. Sometimes you need more detail than just
> "Python looks". You need to know precisely *how* Python looks, and how it
> decides whether it has found or not.

Agreed. So, looking back at the original context: A question was posed
that isn't really about Python at all, but more about file systems. I
gave a simple one-sentence answer that omitted heaps of details. It
didn't seem likely that someone confused by path names would be
changing the current directory inside the script, nor that the
distinction of who evaluates a path would be significant (how often
does _anyone_ care whether your path is parsed by Python, by the OS,
or by the underlying file system?).

ChrisA

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web