Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #104381 > unrolled thread

Review Request of Python Code

Started bysubhabangalore@gmail.com
First post2016-03-08 20:18 -0800
Last post2016-03-10 19:56 +0000
Articles 12 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  Review Request of Python Code subhabangalore@gmail.com - 2016-03-08 20:18 -0800
    Re: Review Request of Python Code Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-03-09 16:10 +1100
    Re: Review Request of Python Code INADA Naoki <songofacandy@gmail.com> - 2016-03-09 16:52 +0900
    Re: Review Request of Python Code Friedrich Rentsch <anthra.norell@bluewin.ch> - 2016-03-09 10:06 +0100
    Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-09 12:06 +0000
    Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-09 12:33 +0000
    Re: Review Request of Python Code subhabangalore@gmail.com - 2016-03-10 10:12 -0800
      Re: Review Request of Python Code BartC <bc@freeuk.com> - 2016-03-10 18:36 +0000
      Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-10 18:51 +0000
        Re: Review Request of Python Code subhabangalore@gmail.com - 2016-03-10 12:14 -0800
      RE: Review Request of Python Code Joaquin Alzola <Joaquin.Alzola@lebara.com> - 2016-03-10 19:12 +0000
    Re: Review Request of Python Code Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 19:56 +0000

#104381 — Review Request of Python Code

Fromsubhabangalore@gmail.com
Date2016-03-08 20:18 -0800
SubjectReview Request of Python Code
Message-ID<f0973a0d-62ba-402b-ab23-cb68bdd15323@googlegroups.com>
Dear Group,

I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM. 

I am trying to copy the code here, for your kind review. 

import MySQLdb
import nltk
def sql_connect_NewTest1():
    db = MySQLdb.connect(host="localhost",
                     user="*****",         
                     passwd="*****",  
                     db="abcd_efgh")
    cur = db.cursor()
    #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
    cur.execute("SELECT * FROM newsinput limit 0,50;")
    dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE 
    dict_read=dict_open.read() 
    dict_word=dict_read.split()
    a4=dict_word #Assignment for code. 
    list1=[]
    flist1=[]
    nlist=[]
    for row in cur.fetchall():
        #print row[2]
        var1=row[3]
        #print var1 #Printing lines
        #var2=len(var1) # Length of file
        var3=var1.split(".") #SPLITTING INTO LINES
        #print var3 #Printing The Lines 
        #list1.append(var1)
        var4=len(var3) #Number of all lines
        #print "No",var4
        for line in var3:
            #print line
            #flist1.append(line)
            linew=line.split()
            for word in linew:
                if word in a4:
                    windex=a4.index(word)
                    windex1=windex+1
                    word1=a4[windex1]
                    word2=word+"/"+word1
                    nlist.append(word2)
                    #print list1
                    #print nlist
                elif word not in a4:
                    word3=word+"/"+"NA"
                    nlist.append(word3)
                    #print list1
                    #print nlist
                else:
                    print "None"
            
    #print "###",flist1
    #print len(flist1)
    #db.close()
    #print nlist
    lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES 
    nlist1=lol(nlist,7)
    #print nlist1
    for i in nlist1:
        string1=" ".join(i)
        print i
        #print string1
    
   
Thanks in Advance.
        
        
    
    
    
    

[toc] | [next] | [standalone]


#104383

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2016-03-09 16:10 +1100
Message-ID<56dfb063$0$1508$c3e8da3$5496439d@news.astraweb.com>
In reply to#104381
On Wednesday 09 March 2016 15:18, subhabangalore@gmail.com wrote:

> I am trying to copy the code here, for your kind review.
> 
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():

This function says that it connects to the SQL database, but actually does 
much more. It does too much. Split your big function into small functions 
that do one thing each.

Your code has too many generic variable names like "var1" ("oh, this is 
variable number 1? how useful to know!") and too many commented out dead 
lines which make it hard to read. There are too many temporary variables 
that get used once, then never used again. You should give your variables 
names which explain what they are or what they are used for. You need to use 
better comments: explain *why* you do things, don't just write a comment 
that repeats what the code does:

dict_open = open(...) #OPENING THE DICTIONARY FILE 

That comment is useless. The code tells us that you are opening the 
dictionary file.

Because I don't completely understand what your code is trying to do, I 
cannot simplify the code or rewrite it very well. But I've tried. Try this, 
and see it it helps. If not, try simplifying the code some more, explain 
what it does better, and then we'll see if we can speed it up.



import MySQLdb
import nltk

def get_words(filename):
    """Return words from a dictionary file."""
    with open(filename, "r") as f:
        words = f.read().split()
    return words

def join_suffix(word, suffix):
    return word + "/" + suffix

def split_sentence(alist, size):
    """Split sentence (a list of words) into chunks of the given size."""
    return [alist[i:i+size] for i in range(0, len(alist), size)]

def process():
    db = MySQLdb.connect(host="localhost",
                     user="*****",
                     passwd="*****",
                     db="abcd_efgh")
    cur = db.cursor()
    cur.execute("SELECT * FROM newsinput limit 0,50;")
    dict_words = get_words("/python27/NewTotalTag.txt")
    words = []
    for row in cur.fetchall():
        lines = row[3].split(".")
        for line in lines:
            for word in line.split():
                if word in dict_words:
                    i = dict_words.index(word)
                    next_word =  dict_words[i + 1]
                else:
                    next_word = "NA"
                words.append(join_suffix(word, next_word))
    db.close()
    chunks = split_sentence(words, 7)
    for chunk in chunks:
        print chunk




-- 
Steve

[toc] | [prev] | [next] | [standalone]


#104393

FromINADA Naoki <songofacandy@gmail.com>
Date2016-03-09 16:52 +0900
Message-ID<mailman.70.1457509983.15725.python-list@python.org>
In reply to#104381
While MySQL doesn't have server side cursor, MySQLdb has SSCursor class.
https://github.com/PyMySQL/mysqlclient-python/blob/master/MySQLdb/cursors.py#L551

Default cursor fetches MySQL response at once and convert them into Python
object.
SSCursor fetches MySQL response row by row.  So it saves Python memory
consumption (and
MySQL server can't release some resource until client fetches all rows.)

To use SScursor:

    cur = db.cursor()
>

cur = db.cursor(MySQLdb.cursors.SSCursor)

    for row in cur.fetchall():
>

for row in cur:

-- 
INADA Naoki  <songofacandy@gmail.com>

[toc] | [prev] | [next] | [standalone]


#104397

FromFriedrich Rentsch <anthra.norell@bluewin.ch>
Date2016-03-09 10:06 +0100
Message-ID<mailman.74.1457514455.15725.python-list@python.org>
In reply to#104381

On 03/09/2016 05:18 AM, subhabangalore@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
>      db = MySQLdb.connect(host="localhost",
>                       user="*****",
>                       passwd="*****",
>                       db="abcd_efgh")
>      cur = db.cursor()
>      #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
>      cur.execute("SELECT * FROM newsinput limit 0,50;")
>      dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
>      dict_read=dict_open.read()
>      dict_word=dict_read.split()
>      a4=dict_word #Assignment for code.
>      list1=[]
>      flist1=[]
>      nlist=[]
>      for row in cur.fetchall():
>          #print row[2]
>          var1=row[3]
>          #print var1 #Printing lines
>          #var2=len(var1) # Length of file
>          var3=var1.split(".") #SPLITTING INTO LINES
>          #print var3 #Printing The Lines
>          #list1.append(var1)
>          var4=len(var3) #Number of all lines
>          #print "No",var4
>          for line in var3:
>              #print line
>              #flist1.append(line)
>              linew=line.split()
>              for word in linew:
>                  if word in a4:
>                      windex=a4.index(word)
>                      windex1=windex+1
>                      word1=a4[windex1]
>                      word2=word+"/"+word1
>                      nlist.append(word2)
>                      #print list1
>                      #print nlist
>                  elif word not in a4:
>                      word3=word+"/"+"NA"
>                      nlist.append(word3)
>                      #print list1
>                      #print nlist
>                  else:
>                      print "None"
>              
>      #print "###",flist1
>      #print len(flist1)
>      #db.close()
>      #print nlist
>      lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
>      nlist1=lol(nlist,7)
>      #print nlist1
>      for i in nlist1:
>          string1=" ".join(i)
>          print i
>          #print string1
>      
>     
> Thanks in Advance.
>          
>      

I have a modular processing framework in its final stages of completion 
whose purpose is to save (a lot of) time coding the kind of problem you 
describe. I intend to upload the system and am currently interested in 
real-world cases for the manual. I tried coding your problem, thinking 
it would take no more than a minute. It wasn't that easy, because don't 
say what input you have, nor what you expect your program to do. 
Inferring the missing info from your code takes more time that I can 
spare. So, if you would give a few lines of your input and explain your 
purpose, I'd be happy to help.

Frederic

[toc] | [prev] | [next] | [standalone]


#104403

FromMatt Wheeler <m@funkyhat.org>
Date2016-03-09 12:06 +0000
Message-ID<mailman.75.1457525219.15725.python-list@python.org>
In reply to#104381
I'm just going to focus on a couple of lines as others are already
looking at the whole thing:

On 9 March 2016 at 04:18,  <subhabangalore@gmail.com> wrote:
> [snip].........
>                 if word in a4:
>                     [stuff]
>                 elif word not in a4:
>                     [other stuff]
>                 else:
>                     print "None"

This is bad for a couple of reasons.

Most obviously, your `else: print "None"` case can never be reached.
word not in a4 is the inverse of word in a4.
That also means for the `not` case the entire a4 list is scanned
*twice*, and the second time is completely pointless.

This can be simplified to
                 if word in a4:
                     [stuff]
                 else:
                     [other stuff]

But we can still do better. A list is a poor choice for this kind of
lookup, as Python has no way to find elements other than by checking
them one after another. (given (one of the) name(s) you've given it
sounds a bit like "dictionary" I assume it contains rather a lot of
items)

If you change one other line:

> dict_word=dict_read.split()
> a4=dict_word #Assignment for code.

a4=set(dict_word)
#(this could of course be shortened further but I'll leave that to you/others)

You'll probably see a massive speedup in your code, possibly even
dwarfing the speedup you see from more sensible database access like
INADA Naoki suggested (though you should definitely still do that
too!), especially if your word list is very large.

This is because the set type uses a hashmap internally, making lookups
for matches extremely fast, compared to scanning through the list.


-- 
Matt Wheeler
http://funkyh.at

[toc] | [prev] | [next] | [standalone]


#104405

FromMatt Wheeler <m@funkyhat.org>
Date2016-03-09 12:33 +0000
Message-ID<mailman.76.1457526806.15725.python-list@python.org>
In reply to#104381
On 9 March 2016 at 12:06, Matt Wheeler <m@funkyhat.org> wrote:
> But we can still do better. A list is a poor choice for this kind of
> lookup, as Python has no way to find elements other than by checking
> them one after another. (given (one of the) name(s) you've given it
> sounds a bit like "dictionary" I assume it contains rather a lot of
> items)

Sorry, I've just read your original code properly and see that you're
looking up the next item in the list, this means a set is not
suitable, as it doesn't preserve order (however, your original code is
open to an IndexError if the last element in your list is matched).

If you could provide a sample of the NewTotalTag.txt file data that
would be helpful, but working with the information I've got we can
still get a comparable speedup, by constructing a dict upfront mapping
each word to the next one[1]:

dict_word=dict_read.split()
dict_word.append('N/A')
# Assuming that 'N/A' is a reasonable output if the last word in your
list is matched.
# This works around the IndexError your current code is exposed to.
# The slice ([:-1]) means we don't try to add the last item to the new a4 dict.
a4={}
for index,word in enumerate(words[:-1]):
    a4[word] = dict_word[index+1]

This creates a dict where each key maps to the corresponding next
word, which you can use later in your lookup instead of fetching by
index. i.e. a4[word] instead of a4[windex+1].
This means you're saving yet *another* scan through of the entire list
(`a4.index(word)` has to scan yet again) for the positive matches.


[1] though I suspect if we get to see a sample of your data file there
may be a better way

-- 
Matt Wheeler
http://funkyh.at

[toc] | [prev] | [next] | [standalone]


#104536

Fromsubhabangalore@gmail.com
Date2016-03-10 10:12 -0800
Message-ID<af65a7a6-3179-4bca-9022-ae0d2ec61a11@googlegroups.com>
In reply to#104381
On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
> 
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM. 
> 
> I am trying to copy the code here, for your kind review. 
> 
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
>     db = MySQLdb.connect(host="localhost",
>                      user="*****",         
>                      passwd="*****",  
>                      db="abcd_efgh")
>     cur = db.cursor()
>     #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
>     cur.execute("SELECT * FROM newsinput limit 0,50;")
>     dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE 
>     dict_read=dict_open.read() 
>     dict_word=dict_read.split()
>     a4=dict_word #Assignment for code. 
>     list1=[]
>     flist1=[]
>     nlist=[]
>     for row in cur.fetchall():
>         #print row[2]
>         var1=row[3]
>         #print var1 #Printing lines
>         #var2=len(var1) # Length of file
>         var3=var1.split(".") #SPLITTING INTO LINES
>         #print var3 #Printing The Lines 
>         #list1.append(var1)
>         var4=len(var3) #Number of all lines
>         #print "No",var4
>         for line in var3:
>             #print line
>             #flist1.append(line)
>             linew=line.split()
>             for word in linew:
>                 if word in a4:
>                     windex=a4.index(word)
>                     windex1=windex+1
>                     word1=a4[windex1]
>                     word2=word+"/"+word1
>                     nlist.append(word2)
>                     #print list1
>                     #print nlist
>                 elif word not in a4:
>                     word3=word+"/"+"NA"
>                     nlist.append(word3)
>                     #print list1
>                     #print nlist
>                 else:
>                     print "None"
>             
>     #print "###",flist1
>     #print len(flist1)
>     #db.close()
>     #print nlist
>     lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES 
>     nlist1=lol(nlist,7)
>     #print nlist1
>     for i in nlist1:
>         string1=" ".join(i)
>         print i
>         #print string1
>     
>    
> Thanks in Advance.

****************************************************************************
Dear Group,

Thank you all, for your kind time and all suggestions in helping me.

Thank you Steve for writing the whole code. It is working full 
and fine. But speed is still an issue. We need to speed up. 

Inada I tried to change to 
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin 
said that may not be an issue.

Freidrich, my problem is I have a big text repository of .txt
files in MySQL in the backend. I have another list of words with
their possible tags. The tags are not conventional Parts of Speech(PoS)
tags,  and bit defined by others. 
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate
tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that
if there is a string of n words, it should look like,
w1/tag w2/tag w3/tag w4/tag ....wn/tag, 

where tag may be tag in the list or NA as per the situation.

This format is taken because the files are expected to be tagged
in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need 
some specifications. I want to use it as Tagged Corpus format. 

Now the tagged data coming out in this format, should be one 
tagged sentences in each new line or a lattice. 

They expect the data to be saved in .pos format but presently 
I am not doing in this code, I may do that later. 

Please let me know if I need to give any more information.

Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
is like a simple list of words with unconventional tags, like,

w1 tag1
w2 tag2
w3 tag3
...
...
w3  tag3

like that. 

Regards,
Subhabrata  

  

[toc] | [prev] | [next] | [standalone]


#104539

FromBartC <bc@freeuk.com>
Date2016-03-10 18:36 +0000
Message-ID<nbsela$6d9$1@dont-email.me>
In reply to#104536
On 10/03/2016 18:12, subhabangalore@gmail.com wrote:
> On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote:

> Thank you Steve for writing the whole code. It is working full
> and fine. But speed is still an issue. We need to speed up.

Which bit is too slow? (Perhaps the print statements in your original 
code will give a clue.)

How many rows, lines and words are we talking about (ie. how many inner 
loops)? How big is the text file? Is the outer function called once and 
that shows the problem, or many times?

It might be that the task is big enough that it actually takes that long.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]


#104541

FromMatt Wheeler <m@funkyhat.org>
Date2016-03-10 18:51 +0000
Message-ID<mailman.145.1457635934.15725.python-list@python.org>
In reply to#104536
On 10 March 2016 at 18:12,  <subhabangalore@gmail.com> wrote:
> Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
> is like a simple list of words with unconventional tags, like,
>
> w1 tag1
> w2 tag2
> w3 tag3
> ...
> ...
> w3  tag3
>
> like that.

I suspected so. The way your code currently works, if your input text
contains one of the tags, e.g. 'tag1' you'll get an entry in your
output something like 'tag1/w2'. I assume you don't want that :).

This is because you're using a single list to include all of the tags.
Try something along the lines of:

dict_word={} #empty dictionary
for line in dict_read.splitlines():
    word, tag = line.split(' ')
    dict_word[word] = tag

Notice I'm using splitlines() instead of split() to do the initial
chopping up of your input. split() will split on any whitespace by
default. splitlines should be self-explanatory.

I would split this and the file-open out into a separate function at
this point. Large blobs of sequential code are not particularly easy
on the eyes or the brain -- choose a sensible name, like
load_dictionary. Perhaps something you could call like:

dict_word = load_dictionary("NewTotalTag.txt")


You also aren't closing the file that you open at any point -- once
you've loaded the data from it there's no need to keep the file opened
(look up context managers).

-- 
Matt Wheeler
http://funkyh.at

[toc] | [prev] | [next] | [standalone]


#104550

Fromsubhabangalore@gmail.com
Date2016-03-10 12:14 -0800
Message-ID<bcd40141-b482-4fb9-8346-f8a12f881924@googlegroups.com>
In reply to#104541
On Friday, March 11, 2016 at 12:22:31 AM UTC+5:30, Matt Wheeler wrote:
> On 10 March 2016 at 18:12,   wrote:
> > Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
> > is like a simple list of words with unconventional tags, like,
> >
> > w1 tag1
> > w2 tag2
> > w3 tag3
> > ...
> > ...
> > w3  tag3
> >
> > like that.
> 
> I suspected so. The way your code currently works, if your input text
> contains one of the tags, e.g. 'tag1' you'll get an entry in your
> output something like 'tag1/w2'. I assume you don't want that :).
> 
> This is because you're using a single list to include all of the tags.
> Try something along the lines of:
> 
> dict_word={} #empty dictionary
> for line in dict_read.splitlines():
>     word, tag = line.split(' ')
>     dict_word[word] = tag
> 
> Notice I'm using splitlines() instead of split() to do the initial
> chopping up of your input. split() will split on any whitespace by
> default. splitlines should be self-explanatory.
> 
> I would split this and the file-open out into a separate function at
> this point. Large blobs of sequential code are not particularly easy
> on the eyes or the brain -- choose a sensible name, like
> load_dictionary. Perhaps something you could call like:
> 
> dict_word = load_dictionary("NewTotalTag.txt")
> 
> 
> You also aren't closing the file that you open at any point -- once
> you've loaded the data from it there's no need to keep the file opened
> (look up context managers).
> 
> -- 
> Matt Wheeler
> http://funkyh.at

Dear Matt,

I want in the format of w1/tag1...you may find my detailed problem statement in 
reply of someone else's query. If you feel I would write again for you.

Regards,
Subhabrata

[toc] | [prev] | [next] | [standalone]


#104544

FromJoaquin Alzola <Joaquin.Alzola@lebara.com>
Date2016-03-10 19:12 +0000
Message-ID<mailman.148.1457638025.15725.python-list@python.org>
In reply to#104536
SQL doesn't allow decimal numbers for LIMIT.
Use decimal numbers it still work but is the proper way.

Then clean up a bit your code and remove the commented lines #

-----Original Message-----
From: Python-list [mailto:python-list-bounces+joaquin.alzola=lebara.com@python.org] On Behalf Of subhabangalore@gmail.com
Sent: 10 March 2016 18:12
To: python-list@python.org
Subject: Re: Review Request of Python Code

On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
>     db = MySQLdb.connect(host="localhost",
>                      user="*****",
>                      passwd="*****",
>                      db="abcd_efgh")
>     cur = db.cursor()
>     #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
>     cur.execute("SELECT * FROM newsinput limit 0,50;")
>     dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
>     dict_read=dict_open.read()
>     dict_word=dict_read.split()
>     a4=dict_word #Assignment for code.
>     list1=[]
>     flist1=[]
>     nlist=[]
>     for row in cur.fetchall():
>         #print row[2]
>         var1=row[3]
>         #print var1 #Printing lines
>         #var2=len(var1) # Length of file
>         var3=var1.split(".") #SPLITTING INTO LINES
>         #print var3 #Printing The Lines
>         #list1.append(var1)
>         var4=len(var3) #Number of all lines
>         #print "No",var4
>         for line in var3:
>             #print line
>             #flist1.append(line)
>             linew=line.split()
>             for word in linew:
>                 if word in a4:
>                     windex=a4.index(word)
>                     windex1=windex+1
>                     word1=a4[windex1]
>                     word2=word+"/"+word1
>                     nlist.append(word2)
>                     #print list1
>                     #print nlist
>                 elif word not in a4:
>                     word3=word+"/"+"NA"
>                     nlist.append(word3)
>                     #print list1
>                     #print nlist
>                 else:
>                     print "None"
>
>     #print "###",flist1
>     #print len(flist1)
>     #db.close()
>     #print nlist
>     lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
>     nlist1=lol(nlist,7)
>     #print nlist1
>     for i in nlist1:
>         string1=" ".join(i)
>         print i
>         #print string1
>
>
> Thanks in Advance.

****************************************************************************
Dear Group,

Thank you all, for your kind time and all suggestions in helping me.

Thank you Steve for writing the whole code. It is working full and fine. But speed is still an issue. We need to speed up.

Inada I tried to change to
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin said that may not be an issue.

Freidrich, my problem is I have a big text repository of .txt files in MySQL in the backend. I have another list of words with their possible tags. The tags are not conventional Parts of Speech(PoS) tags,  and bit defined by others.
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that if there is a string of n words, it should look like, w1/tag w2/tag w3/tag w4/tag ....wn/tag,

where tag may be tag in the list or NA as per the situation.

This format is taken because the files are expected to be tagged in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need some specifications. I want to use it as Tagged Corpus format.

Now the tagged data coming out in this format, should be one tagged sentences in each new line or a lattice.

They expect the data to be saved in .pos format but presently I am not doing in this code, I may do that later.

Please let me know if I need to give any more information.

Matt, thank you for if...else suggestion, the data of NewTotalTag.txt is like a simple list of words with unconventional tags, like,

w1 tag1
w2 tag2
w3 tag3
...
...
w3  tag3

like that.

Regards,
Subhabrata


--
https://mail.python.org/mailman/listinfo/python-list
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

[toc] | [prev] | [next] | [standalone]


#104548

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2016-03-10 19:56 +0000
Message-ID<mailman.152.1457639827.15725.python-list@python.org>
In reply to#104381
On 09/03/2016 04:18, subhabangalore@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
>      cur = db.cursor()
>      dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE

As you've had and acknowledged some sound answers, I'll simply point out 
that many people find the first line above, with just that little bit of 
whitespace, far easier to read than the second.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web