Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #104381 > unrolled thread
| Started by | subhabangalore@gmail.com |
|---|---|
| First post | 2016-03-08 20:18 -0800 |
| Last post | 2016-03-10 19:56 +0000 |
| Articles | 12 — 8 participants |
Back to article view | Back to comp.lang.python
Review Request of Python Code subhabangalore@gmail.com - 2016-03-08 20:18 -0800
Re: Review Request of Python Code Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-03-09 16:10 +1100
Re: Review Request of Python Code INADA Naoki <songofacandy@gmail.com> - 2016-03-09 16:52 +0900
Re: Review Request of Python Code Friedrich Rentsch <anthra.norell@bluewin.ch> - 2016-03-09 10:06 +0100
Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-09 12:06 +0000
Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-09 12:33 +0000
Re: Review Request of Python Code subhabangalore@gmail.com - 2016-03-10 10:12 -0800
Re: Review Request of Python Code BartC <bc@freeuk.com> - 2016-03-10 18:36 +0000
Re: Review Request of Python Code Matt Wheeler <m@funkyhat.org> - 2016-03-10 18:51 +0000
Re: Review Request of Python Code subhabangalore@gmail.com - 2016-03-10 12:14 -0800
RE: Review Request of Python Code Joaquin Alzola <Joaquin.Alzola@lebara.com> - 2016-03-10 19:12 +0000
Re: Review Request of Python Code Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-10 19:56 +0000
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2016-03-08 20:18 -0800 |
| Subject | Review Request of Python Code |
| Message-ID | <f0973a0d-62ba-402b-ab23-cb68bdd15323@googlegroups.com> |
Dear Group,
I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
I am trying to copy the code here, for your kind review.
import MySQLdb
import nltk
def sql_connect_NewTest1():
db = MySQLdb.connect(host="localhost",
user="*****",
passwd="*****",
db="abcd_efgh")
cur = db.cursor()
#cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
cur.execute("SELECT * FROM newsinput limit 0,50;")
dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
dict_read=dict_open.read()
dict_word=dict_read.split()
a4=dict_word #Assignment for code.
list1=[]
flist1=[]
nlist=[]
for row in cur.fetchall():
#print row[2]
var1=row[3]
#print var1 #Printing lines
#var2=len(var1) # Length of file
var3=var1.split(".") #SPLITTING INTO LINES
#print var3 #Printing The Lines
#list1.append(var1)
var4=len(var3) #Number of all lines
#print "No",var4
for line in var3:
#print line
#flist1.append(line)
linew=line.split()
for word in linew:
if word in a4:
windex=a4.index(word)
windex1=windex+1
word1=a4[windex1]
word2=word+"/"+word1
nlist.append(word2)
#print list1
#print nlist
elif word not in a4:
word3=word+"/"+"NA"
nlist.append(word3)
#print list1
#print nlist
else:
print "None"
#print "###",flist1
#print len(flist1)
#db.close()
#print nlist
lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
nlist1=lol(nlist,7)
#print nlist1
for i in nlist1:
string1=" ".join(i)
print i
#print string1
Thanks in Advance.
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-03-09 16:10 +1100 |
| Message-ID | <56dfb063$0$1508$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #104381 |
On Wednesday 09 March 2016 15:18, subhabangalore@gmail.com wrote:
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
This function says that it connects to the SQL database, but actually does
much more. It does too much. Split your big function into small functions
that do one thing each.
Your code has too many generic variable names like "var1" ("oh, this is
variable number 1? how useful to know!") and too many commented out dead
lines which make it hard to read. There are too many temporary variables
that get used once, then never used again. You should give your variables
names which explain what they are or what they are used for. You need to use
better comments: explain *why* you do things, don't just write a comment
that repeats what the code does:
dict_open = open(...) #OPENING THE DICTIONARY FILE
That comment is useless. The code tells us that you are opening the
dictionary file.
Because I don't completely understand what your code is trying to do, I
cannot simplify the code or rewrite it very well. But I've tried. Try this,
and see it it helps. If not, try simplifying the code some more, explain
what it does better, and then we'll see if we can speed it up.
import MySQLdb
import nltk
def get_words(filename):
"""Return words from a dictionary file."""
with open(filename, "r") as f:
words = f.read().split()
return words
def join_suffix(word, suffix):
return word + "/" + suffix
def split_sentence(alist, size):
"""Split sentence (a list of words) into chunks of the given size."""
return [alist[i:i+size] for i in range(0, len(alist), size)]
def process():
db = MySQLdb.connect(host="localhost",
user="*****",
passwd="*****",
db="abcd_efgh")
cur = db.cursor()
cur.execute("SELECT * FROM newsinput limit 0,50;")
dict_words = get_words("/python27/NewTotalTag.txt")
words = []
for row in cur.fetchall():
lines = row[3].split(".")
for line in lines:
for word in line.split():
if word in dict_words:
i = dict_words.index(word)
next_word = dict_words[i + 1]
else:
next_word = "NA"
words.append(join_suffix(word, next_word))
db.close()
chunks = split_sentence(words, 7)
for chunk in chunks:
print chunk
--
Steve
[toc] | [prev] | [next] | [standalone]
| From | INADA Naoki <songofacandy@gmail.com> |
|---|---|
| Date | 2016-03-09 16:52 +0900 |
| Message-ID | <mailman.70.1457509983.15725.python-list@python.org> |
| In reply to | #104381 |
While MySQL doesn't have server side cursor, MySQLdb has SSCursor class.
https://github.com/PyMySQL/mysqlclient-python/blob/master/MySQLdb/cursors.py#L551
Default cursor fetches MySQL response at once and convert them into Python
object.
SSCursor fetches MySQL response row by row. So it saves Python memory
consumption (and
MySQL server can't release some resource until client fetches all rows.)
To use SScursor:
cur = db.cursor()
>
cur = db.cursor(MySQLdb.cursors.SSCursor)
for row in cur.fetchall():
>
for row in cur:
--
INADA Naoki <songofacandy@gmail.com>
[toc] | [prev] | [next] | [standalone]
| From | Friedrich Rentsch <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2016-03-09 10:06 +0100 |
| Message-ID | <mailman.74.1457514455.15725.python-list@python.org> |
| In reply to | #104381 |
On 03/09/2016 05:18 AM, subhabangalore@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
> db = MySQLdb.connect(host="localhost",
> user="*****",
> passwd="*****",
> db="abcd_efgh")
> cur = db.cursor()
> #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
> cur.execute("SELECT * FROM newsinput limit 0,50;")
> dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
> dict_read=dict_open.read()
> dict_word=dict_read.split()
> a4=dict_word #Assignment for code.
> list1=[]
> flist1=[]
> nlist=[]
> for row in cur.fetchall():
> #print row[2]
> var1=row[3]
> #print var1 #Printing lines
> #var2=len(var1) # Length of file
> var3=var1.split(".") #SPLITTING INTO LINES
> #print var3 #Printing The Lines
> #list1.append(var1)
> var4=len(var3) #Number of all lines
> #print "No",var4
> for line in var3:
> #print line
> #flist1.append(line)
> linew=line.split()
> for word in linew:
> if word in a4:
> windex=a4.index(word)
> windex1=windex+1
> word1=a4[windex1]
> word2=word+"/"+word1
> nlist.append(word2)
> #print list1
> #print nlist
> elif word not in a4:
> word3=word+"/"+"NA"
> nlist.append(word3)
> #print list1
> #print nlist
> else:
> print "None"
>
> #print "###",flist1
> #print len(flist1)
> #db.close()
> #print nlist
> lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
> nlist1=lol(nlist,7)
> #print nlist1
> for i in nlist1:
> string1=" ".join(i)
> print i
> #print string1
>
>
> Thanks in Advance.
>
>
I have a modular processing framework in its final stages of completion
whose purpose is to save (a lot of) time coding the kind of problem you
describe. I intend to upload the system and am currently interested in
real-world cases for the manual. I tried coding your problem, thinking
it would take no more than a minute. It wasn't that easy, because don't
say what input you have, nor what you expect your program to do.
Inferring the missing info from your code takes more time that I can
spare. So, if you would give a few lines of your input and explain your
purpose, I'd be happy to help.
Frederic
[toc] | [prev] | [next] | [standalone]
| From | Matt Wheeler <m@funkyhat.org> |
|---|---|
| Date | 2016-03-09 12:06 +0000 |
| Message-ID | <mailman.75.1457525219.15725.python-list@python.org> |
| In reply to | #104381 |
I'm just going to focus on a couple of lines as others are already
looking at the whole thing:
On 9 March 2016 at 04:18, <subhabangalore@gmail.com> wrote:
> [snip].........
> if word in a4:
> [stuff]
> elif word not in a4:
> [other stuff]
> else:
> print "None"
This is bad for a couple of reasons.
Most obviously, your `else: print "None"` case can never be reached.
word not in a4 is the inverse of word in a4.
That also means for the `not` case the entire a4 list is scanned
*twice*, and the second time is completely pointless.
This can be simplified to
if word in a4:
[stuff]
else:
[other stuff]
But we can still do better. A list is a poor choice for this kind of
lookup, as Python has no way to find elements other than by checking
them one after another. (given (one of the) name(s) you've given it
sounds a bit like "dictionary" I assume it contains rather a lot of
items)
If you change one other line:
> dict_word=dict_read.split()
> a4=dict_word #Assignment for code.
a4=set(dict_word)
#(this could of course be shortened further but I'll leave that to you/others)
You'll probably see a massive speedup in your code, possibly even
dwarfing the speedup you see from more sensible database access like
INADA Naoki suggested (though you should definitely still do that
too!), especially if your word list is very large.
This is because the set type uses a hashmap internally, making lookups
for matches extremely fast, compared to scanning through the list.
--
Matt Wheeler
http://funkyh.at
[toc] | [prev] | [next] | [standalone]
| From | Matt Wheeler <m@funkyhat.org> |
|---|---|
| Date | 2016-03-09 12:33 +0000 |
| Message-ID | <mailman.76.1457526806.15725.python-list@python.org> |
| In reply to | #104381 |
On 9 March 2016 at 12:06, Matt Wheeler <m@funkyhat.org> wrote:
> But we can still do better. A list is a poor choice for this kind of
> lookup, as Python has no way to find elements other than by checking
> them one after another. (given (one of the) name(s) you've given it
> sounds a bit like "dictionary" I assume it contains rather a lot of
> items)
Sorry, I've just read your original code properly and see that you're
looking up the next item in the list, this means a set is not
suitable, as it doesn't preserve order (however, your original code is
open to an IndexError if the last element in your list is matched).
If you could provide a sample of the NewTotalTag.txt file data that
would be helpful, but working with the information I've got we can
still get a comparable speedup, by constructing a dict upfront mapping
each word to the next one[1]:
dict_word=dict_read.split()
dict_word.append('N/A')
# Assuming that 'N/A' is a reasonable output if the last word in your
list is matched.
# This works around the IndexError your current code is exposed to.
# The slice ([:-1]) means we don't try to add the last item to the new a4 dict.
a4={}
for index,word in enumerate(words[:-1]):
a4[word] = dict_word[index+1]
This creates a dict where each key maps to the corresponding next
word, which you can use later in your lookup instead of fetching by
index. i.e. a4[word] instead of a4[windex+1].
This means you're saving yet *another* scan through of the entire list
(`a4.index(word)` has to scan yet again) for the positive matches.
[1] though I suspect if we get to see a sample of your data file there
may be a better way
--
Matt Wheeler
http://funkyh.at
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2016-03-10 10:12 -0800 |
| Message-ID | <af65a7a6-3179-4bca-9022-ae0d2ec61a11@googlegroups.com> |
| In reply to | #104381 |
On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
> db = MySQLdb.connect(host="localhost",
> user="*****",
> passwd="*****",
> db="abcd_efgh")
> cur = db.cursor()
> #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
> cur.execute("SELECT * FROM newsinput limit 0,50;")
> dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
> dict_read=dict_open.read()
> dict_word=dict_read.split()
> a4=dict_word #Assignment for code.
> list1=[]
> flist1=[]
> nlist=[]
> for row in cur.fetchall():
> #print row[2]
> var1=row[3]
> #print var1 #Printing lines
> #var2=len(var1) # Length of file
> var3=var1.split(".") #SPLITTING INTO LINES
> #print var3 #Printing The Lines
> #list1.append(var1)
> var4=len(var3) #Number of all lines
> #print "No",var4
> for line in var3:
> #print line
> #flist1.append(line)
> linew=line.split()
> for word in linew:
> if word in a4:
> windex=a4.index(word)
> windex1=windex+1
> word1=a4[windex1]
> word2=word+"/"+word1
> nlist.append(word2)
> #print list1
> #print nlist
> elif word not in a4:
> word3=word+"/"+"NA"
> nlist.append(word3)
> #print list1
> #print nlist
> else:
> print "None"
>
> #print "###",flist1
> #print len(flist1)
> #db.close()
> #print nlist
> lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
> nlist1=lol(nlist,7)
> #print nlist1
> for i in nlist1:
> string1=" ".join(i)
> print i
> #print string1
>
>
> Thanks in Advance.
****************************************************************************
Dear Group,
Thank you all, for your kind time and all suggestions in helping me.
Thank you Steve for writing the whole code. It is working full
and fine. But speed is still an issue. We need to speed up.
Inada I tried to change to
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin
said that may not be an issue.
Freidrich, my problem is I have a big text repository of .txt
files in MySQL in the backend. I have another list of words with
their possible tags. The tags are not conventional Parts of Speech(PoS)
tags, and bit defined by others.
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate
tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that
if there is a string of n words, it should look like,
w1/tag w2/tag w3/tag w4/tag ....wn/tag,
where tag may be tag in the list or NA as per the situation.
This format is taken because the files are expected to be tagged
in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need
some specifications. I want to use it as Tagged Corpus format.
Now the tagged data coming out in this format, should be one
tagged sentences in each new line or a lattice.
They expect the data to be saved in .pos format but presently
I am not doing in this code, I may do that later.
Please let me know if I need to give any more information.
Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
is like a simple list of words with unconventional tags, like,
w1 tag1
w2 tag2
w3 tag3
...
...
w3 tag3
like that.
Regards,
Subhabrata
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-03-10 18:36 +0000 |
| Message-ID | <nbsela$6d9$1@dont-email.me> |
| In reply to | #104536 |
On 10/03/2016 18:12, subhabangalore@gmail.com wrote: > On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote: > Thank you Steve for writing the whole code. It is working full > and fine. But speed is still an issue. We need to speed up. Which bit is too slow? (Perhaps the print statements in your original code will give a clue.) How many rows, lines and words are we talking about (ie. how many inner loops)? How big is the text file? Is the outer function called once and that shows the problem, or many times? It might be that the task is big enough that it actually takes that long. -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | Matt Wheeler <m@funkyhat.org> |
|---|---|
| Date | 2016-03-10 18:51 +0000 |
| Message-ID | <mailman.145.1457635934.15725.python-list@python.org> |
| In reply to | #104536 |
On 10 March 2016 at 18:12, <subhabangalore@gmail.com> wrote:
> Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
> is like a simple list of words with unconventional tags, like,
>
> w1 tag1
> w2 tag2
> w3 tag3
> ...
> ...
> w3 tag3
>
> like that.
I suspected so. The way your code currently works, if your input text
contains one of the tags, e.g. 'tag1' you'll get an entry in your
output something like 'tag1/w2'. I assume you don't want that :).
This is because you're using a single list to include all of the tags.
Try something along the lines of:
dict_word={} #empty dictionary
for line in dict_read.splitlines():
word, tag = line.split(' ')
dict_word[word] = tag
Notice I'm using splitlines() instead of split() to do the initial
chopping up of your input. split() will split on any whitespace by
default. splitlines should be self-explanatory.
I would split this and the file-open out into a separate function at
this point. Large blobs of sequential code are not particularly easy
on the eyes or the brain -- choose a sensible name, like
load_dictionary. Perhaps something you could call like:
dict_word = load_dictionary("NewTotalTag.txt")
You also aren't closing the file that you open at any point -- once
you've loaded the data from it there's no need to keep the file opened
(look up context managers).
--
Matt Wheeler
http://funkyh.at
[toc] | [prev] | [next] | [standalone]
| From | subhabangalore@gmail.com |
|---|---|
| Date | 2016-03-10 12:14 -0800 |
| Message-ID | <bcd40141-b482-4fb9-8346-f8a12f881924@googlegroups.com> |
| In reply to | #104541 |
On Friday, March 11, 2016 at 12:22:31 AM UTC+5:30, Matt Wheeler wrote:
> On 10 March 2016 at 18:12, wrote:
> > Matt, thank you for if...else suggestion, the data of NewTotalTag.txt
> > is like a simple list of words with unconventional tags, like,
> >
> > w1 tag1
> > w2 tag2
> > w3 tag3
> > ...
> > ...
> > w3 tag3
> >
> > like that.
>
> I suspected so. The way your code currently works, if your input text
> contains one of the tags, e.g. 'tag1' you'll get an entry in your
> output something like 'tag1/w2'. I assume you don't want that :).
>
> This is because you're using a single list to include all of the tags.
> Try something along the lines of:
>
> dict_word={} #empty dictionary
> for line in dict_read.splitlines():
> word, tag = line.split(' ')
> dict_word[word] = tag
>
> Notice I'm using splitlines() instead of split() to do the initial
> chopping up of your input. split() will split on any whitespace by
> default. splitlines should be self-explanatory.
>
> I would split this and the file-open out into a separate function at
> this point. Large blobs of sequential code are not particularly easy
> on the eyes or the brain -- choose a sensible name, like
> load_dictionary. Perhaps something you could call like:
>
> dict_word = load_dictionary("NewTotalTag.txt")
>
>
> You also aren't closing the file that you open at any point -- once
> you've loaded the data from it there's no need to keep the file opened
> (look up context managers).
>
> --
> Matt Wheeler
> http://funkyh.at
Dear Matt,
I want in the format of w1/tag1...you may find my detailed problem statement in
reply of someone else's query. If you feel I would write again for you.
Regards,
Subhabrata
[toc] | [prev] | [next] | [standalone]
| From | Joaquin Alzola <Joaquin.Alzola@lebara.com> |
|---|---|
| Date | 2016-03-10 19:12 +0000 |
| Message-ID | <mailman.148.1457638025.15725.python-list@python.org> |
| In reply to | #104536 |
SQL doesn't allow decimal numbers for LIMIT.
Use decimal numbers it still work but is the proper way.
Then clean up a bit your code and remove the commented lines #
-----Original Message-----
From: Python-list [mailto:python-list-bounces+joaquin.alzola=lebara.com@python.org] On Behalf Of subhabangalore@gmail.com
Sent: 10 March 2016 18:12
To: python-list@python.org
Subject: Re: Review Request of Python Code
On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> import MySQLdb
> import nltk
> def sql_connect_NewTest1():
> db = MySQLdb.connect(host="localhost",
> user="*****",
> passwd="*****",
> db="abcd_efgh")
> cur = db.cursor()
> #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUNTIME ERROR
> cur.execute("SELECT * FROM newsinput limit 0,50;")
> dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
> dict_read=dict_open.read()
> dict_word=dict_read.split()
> a4=dict_word #Assignment for code.
> list1=[]
> flist1=[]
> nlist=[]
> for row in cur.fetchall():
> #print row[2]
> var1=row[3]
> #print var1 #Printing lines
> #var2=len(var1) # Length of file
> var3=var1.split(".") #SPLITTING INTO LINES
> #print var3 #Printing The Lines
> #list1.append(var1)
> var4=len(var3) #Number of all lines
> #print "No",var4
> for line in var3:
> #print line
> #flist1.append(line)
> linew=line.split()
> for word in linew:
> if word in a4:
> windex=a4.index(word)
> windex1=windex+1
> word1=a4[windex1]
> word2=word+"/"+word1
> nlist.append(word2)
> #print list1
> #print nlist
> elif word not in a4:
> word3=word+"/"+"NA"
> nlist.append(word3)
> #print list1
> #print nlist
> else:
> print "None"
>
> #print "###",flist1
> #print len(flist1)
> #db.close()
> #print nlist
> lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)] #TRYING TO SPLIT THE RESULTS AS SENTENCES
> nlist1=lol(nlist,7)
> #print nlist1
> for i in nlist1:
> string1=" ".join(i)
> print i
> #print string1
>
>
> Thanks in Advance.
****************************************************************************
Dear Group,
Thank you all, for your kind time and all suggestions in helping me.
Thank you Steve for writing the whole code. It is working full and fine. But speed is still an issue. We need to speed up.
Inada I tried to change to
cur = db.cursor(MySQLdb.cursors.SSCursor) but my System Admin said that may not be an issue.
Freidrich, my problem is I have a big text repository of .txt files in MySQL in the backend. I have another list of words with their possible tags. The tags are not conventional Parts of Speech(PoS) tags, and bit defined by others.
The code is expected to read each file and its each line.
On reading each line it will scan the list for appropriate tag, if it is found it would assign, else would assign NA.
The assignment should be in the format of /tag, so that if there is a string of n words, it should look like, w1/tag w2/tag w3/tag w4/tag ....wn/tag,
where tag may be tag in the list or NA as per the situation.
This format is taken because the files are expected to be tagged in Brown Corpus format. There is a Python Library named NLTK.
If I want to save my data for use with their models, I need some specifications. I want to use it as Tagged Corpus format.
Now the tagged data coming out in this format, should be one tagged sentences in each new line or a lattice.
They expect the data to be saved in .pos format but presently I am not doing in this code, I may do that later.
Please let me know if I need to give any more information.
Matt, thank you for if...else suggestion, the data of NewTotalTag.txt is like a simple list of words with unconventional tags, like,
w1 tag1
w2 tag2
w3 tag3
...
...
w3 tag3
like that.
Regards,
Subhabrata
--
https://mail.python.org/mailman/listinfo/python-list
This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2016-03-10 19:56 +0000 |
| Message-ID | <mailman.152.1457639827.15725.python-list@python.org> |
| In reply to | #104381 |
On 09/03/2016 04:18, subhabangalore@gmail.com wrote:
> Dear Group,
>
> I am trying to write a code for pulling data from MySQL at the backend and annotating words and trying to put the results as separated sentences with each line. The code is generally running fine but I am feeling it may be better in the end of giving out sentences, and for small data sets it is okay but with 50,000 news articles it is performing dead slow. I am using Python2.7.11 on Windows 7 with 8GB RAM.
>
> I am trying to copy the code here, for your kind review.
>
> cur = db.cursor()
> dict_open=open("/python27/NewTotalTag.txt","r") #OPENING THE DICTIONARY FILE
As you've had and acknowledged some sound answers, I'll simply point out
that many people find the first line above, with just that little bit of
whitespace, far easier to read than the second.
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web