Groups > comp.lang.python > #24882 > unrolled thread

Discussion on some Code Issues

Started by	subhabangalore@gmail.com
First post	2012-07-04 16:21 -0700
Last post	2012-07-07 22:42 -0700
Articles	20 on this page of 27 — 8 participants

Back to article view | Back to comp.lang.python

  Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 16:21 -0700
    Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-05 00:02 +0000
    Re: Discussion on some Code Issues Rick Johnson <rantingrickjohnson@gmail.com> - 2012-07-04 17:08 -0700
    Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-04 20:25 -0700
      Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-05 09:30 +0200
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-05 07:33 -0700
          Re: Discussion on some Code Issues Peter Otten <__peter__@web.de> - 2012-07-06 09:35 +0200
    Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 12:54 -0700
      Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-07 16:51 -0400
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700
          Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-08 18:03 +1000
            Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
              Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 03:17 +1000
                Re: Discussion on some Code Issues Roy Smith <roy@panix.com> - 2012-07-08 14:17 -0400
                  Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 07:54 +1000
                    Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 00:57 +0000
                      Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-09 18:41 +1000
                        Re: Discussion on some Code Issues Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-07-09 12:24 +0000
                          Re: Discussion on some Code Issues Chris Angelico <rosuav@gmail.com> - 2012-07-10 00:47 +1000
                      Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-09 12:49 -0400
                Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
                Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-16 07:17 -0700
              Re: Discussion on some Code Issues MRAB <python@mrabarnett.plus.com> - 2012-07-08 19:27 +0100
            Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-08 10:05 -0700
          Re: Discussion on some Code Issues Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-07-08 15:07 -0400
        Re: Discussion on some Code Issues subhabangalore@gmail.com - 2012-07-07 22:42 -0700

Page 1 of 2 [1] 2 Next page →

#24882 — Discussion on some Code Issues

From	subhabangalore@gmail.com
Date	2012-07-04 16:21 -0700
Subject	Discussion on some Code Issues
Message-ID	<a4f0e2a9-cc3b-4081-beb9-82f229e95ba1@googlegroups.com>

Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words[i]=="$":
print (bag_words[i],i)

There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined together) won’t increase on the fly. So, just to support all kinds of combinations I am appending in a list the “I” values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t it a classification problem?

There is no question on parsing it seems I am achieving it independent of length of the document.

If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.

[toc] | [next] | [standalone]

#24883

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-07-05 00:02 +0000
Message-ID	<4ff4d9a3$0$29864$c3e8da3$5496439d@news.astraweb.com>
In reply to	#24882

On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

[...]
> I got to code a bunch of documents  which are combined together.
[...]
> The task is to separate the documents on the fly and to parse each of
> the documents with a definite set of rules.
> 
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
[...]
> But they are separated by a tag set
[...] 
> To detect the document boundaries,

Let me see if I understand your problem.

You have a bunch of documents. You stick them all together into one 
enormous lump. And then you try to detect the boundaries between one file 
and the next within the enormous lump.

Why not just process each file separately? A simple for loop over the 
list of files, before consolidating them into one giant file, will avoid 
all the difficulty of trying to detect boundaries within files.

Instead of:

merge(output_filename,  list_of_files)
for word in parse(output_filename):
    if boundary_detected: do_something()
    process(word)

Do this instead:

for filename in  list_of_files:
    do_something()
    for word in parse(filename):
        process(word)

> I am splitting them into a bag of
> words and using a simple for loop as, 
> for i in range(len(bag_words)):
>         if bag_words[i]=="$":
>             print (bag_words[i],i)

What happens if a file already has a $ in it?

> There is no issue. I am segmenting it nicely. I am using annotated
> corpus so applying parse rules.
> 
> The confusion comes next,
> 
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length,
> and using slice. Works perfect.

I don't understand this. What sort of combinations do you think you need 
to support? What are "I" values, and why are they important?

-- 
Steven

From	Rick Johnson <rantingrickjohnson@gmail.com>
Date	2012-07-04 17:08 -0700
Message-ID	<59b3e7c3-83d8-43a7-9d24-1df682f0a353@j10g2000yqd.googlegroups.com>
In reply to	#24882

From	Peter Otten <__peter__@web.de>
Date	2012-07-05 09:30 +0200
Message-ID	<mailman.1814.1341473418.4697.python-list@python.org>
In reply to	#24893

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2012-07-07 16:51 -0400
Message-ID	<mailman.1901.1341694286.4697.python-list@python.org>
In reply to	#25031

From	Chris Angelico <rosuav@gmail.com>
Date	2012-07-08 18:03 +1000
Message-ID	<mailman.1910.1341734607.4697.python-list@python.org>
In reply to	#25035

From	Roy Smith <roy@panix.com>
Date	2012-07-08 14:17 -0400
Message-ID	<roy-249FE5.14174108072012@news.panix.com>
In reply to	#25047

Discussion on some Code Issues

Contents

#24882 — Discussion on some Code Issues

#24883

#24884

#24893

#24899

#24921

#24922

#24960

#25031

#25032

#25035

#25038

#25045

#25047

#25048

#25056

#25057

#25072

#25079

#25085