Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Joaquin Alzola Newsgroups: comp.lang.python Subject: RE: Review Request of Python Code Date: Thu, 10 Mar 2016 19:12:37 +0000 Lines: 150 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Trace: news.uni-berlin.de j5zknbZsfcImJYhgflGGSAEK4tlCsCGocor+FKcwau6Q== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'elif': 0.04; 'subject:Python': 0.05; 'repository': 0.05; 'assignment': 0.07; 'wednesday,': 0.07; 'cc:addr:python-list': 0.09; '#print': 0.09; 'okay': 0.09; 'runtime': 0.09; 'situation.': 0.09; 'slow.': 0.09; 'python': 0.10; 'skip:# 20': 0.13; 'def': 0.13; 'appropriate': 0.14; 'backend': 0.15; 'message-----': 0.15; '"none"': 0.16; '.txt': 0.16; '2016': 0.16; 'backend.': 0.16; 'commented': 0.16; 'lambda': 0.16; 'list1': 0.16; 'models,': 0.16; 'privilege.': 0.16; 'pulling': 0.16; 'range(0,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'row': 0.16; 'skip:[ 60': 0.16; 'tagged': 0.16; 'tags.': 0.16; 'wrote:': 0.16; 'string': 0.17; 'steve': 0.18; 'all,': 0.20; 'library': 0.20; 'windows': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'issue.': 0.20; 'assign': 0.22; 'subject:Code': 0.22; 'trying': 0.22; 'code,': 0.23; 'code.': 0.23; 'defined': 0.23; 'bit': 0.23; 'advance.': 0.23; 'performing': 0.23; 'sets': 0.23; 'split': 0.23; 'tried': 0.24; 'import': 0.24; 'words': 0.24; 'header:In-Reply- To:1': 0.24; "doesn't": 0.26; 'skip:m 30': 0.27; 'error': 0.27; 'skip:# 10': 0.27; 'format,': 0.27; 'fine': 0.28; 'decimal': 0.29; 'dictionary': 0.29; 'separated': 0.29; 'admin': 0.29; 'print': 0.30; 'that.': 0.30; 'url:mailman': 0.30; 'code': 0.30; 'another': 0.32; 'up.': 0.32; 'generally': 0.32; 'skip:d 40': 0.32; 'problem': 0.33; 'url:python': 0.33; 'skip:- 10': 0.34; 'url:listinfo': 0.34; 'file': 0.34; 'skip:d 20': 0.34; 'running': 0.34; 'list': 0.34; 'sent:': 0.35; 'text': 0.35; 'saved': 0.35; 'subject:': 0.35; 'expected': 0.35; 'but': 0.36; 'should': 0.36; 'there': 0.36; 'url:org': 0.36; 'lines': 0.36; 'possible': 0.36; 'email addr:python.org': 0.36; 'others.': 0.36; 'subject:: ': 0.37; 'received:10': 0.37; 'expect': 0.37; 'thanks': 0.37; 'charset:us-ascii': 0.37; 'doing': 0.38; 'skip:v 20': 0.38; 'thank': 0.38; 'files': 0.38; 'end': 0.39; 'data': 0.39; 'format': 0.39; 'from:': 0.39; 'url:mail': 0.40; 'where': 0.40; 'still': 0.40; 'some': 0.40; 'group,': 0.60; 'save': 0.60; 'your': 0.60; 'skip:u 10': 0.61; 'email addr:gmail.com': 0.62; 'per': 0.62; 'skip:n 10': 0.62; 'more': 0.63; 'march': 0.64; 'limit': 0.65; 'contact': 0.66; 'python-list': 0.66; 'results': 0.66; 'articles': 0.67; 'email name:python-list': 0.67; 'helping': 0.67; 'dear': 0.67; 'news': 0.68; 'subject': 0.70; 'skip:* 70': 0.70; 'disclose': 0.71; 'feeling': 0.72; 'tags,': 0.79; 'sentences,': 0.84; 'suggestion,': 0.84; 'utc+5:30,': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lebara.onmicrosoft.com; s=selector1-lebara-com; h=From:To:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=iXKFq7Q7ppe7VITL3Nu3VtavqPi/wTwn8N7MpBi3mnw=; b=KxVxTE/zcIv8EZHA+5F6vLOv9ikWHLlY9zGmQSis3thakBc/emhECcw3ls4JahSq5Tb49J3VqAEDtsxerHghVpRBYxu5/9g8vA+5CP0ZBpJldCjeGe3S1WrA3sszQYN9jL1otPEgNs8SAQu1jvoV/LZKbEQ6WJwtRxcQTo6gk+0= Thread-Topic: Review Request of Python Code Thread-Index: AQHRevjyQX/WRcnrAE6v6aQJ20neAJ9TCosQ In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=lebara.com; x-originating-ip: [165.225.80.109] x-ms-office365-filtering-correlation-id: 728d74b8-1f17-42a0-aa44-08d34917f162 x-microsoft-exchange-diagnostics: 1; DB5PR07MB1495; 5:8vN1fqBKynSHvS7RRKDfWlzBL+1B+D2ZLgWLwIRm5I5AvlDrZhFaLvVWh+Juev0vn2uUJCj3qd6ZmwLURBqeMS6vIrYSz3myUBVoVr/8HqsPG06sAFmgfAtJAUxpC3fztuBoB4PZ3mxc8/xmzMB8qQ==; 24:83AVw/6nn0dPCGvNwV3N65g/QjiOKixgeKRlBGloSa2JRu+p4AKnx9JqQ6DyClRYzGw/warKMWJcEnurVUJIOO0gQjkCwcNEtI2b87Cdfvc= x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DB5PR07MB1495; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(601004)(2401047)(8121501046)(5005006)(10201501046)(3002001); SRVR:DB5PR07MB1495; BCL:0; PCL:0; RULEID:; SRVR:DB5PR07MB1495; x-forefront-prvs: 08770259B4 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(6009001)(377454003)(61484003)(24454002)(13464003)(54356999)(345774005)(1411001)(50986999)(92566002)(76176999)(5008740100001)(10400500002)(81166005)(5003600100002)(74316001)(19580395003)(19580405001)(5004730100002)(33656002)(66066001)(2351001)(106116001)(5002640100001)(1220700001)(6116002)(1730700002)(86362001)(4326007)(3846002)(122556002)(102836003)(2501003)(586003)(3660700001)(2950100001)(1096002)(76576001)(5640700001)(87936001)(2900100001)(3280700002)(551544002)(2906002)(77096005)(11100500001)(110136002)(189998001)(15975445007); DIR:OUT; SFP:1102; SCL:1; SRVR:DB5PR07MB1495; H:DB5PR07MB1496.eurprd07.prod.outlook.com; FPR:; SPF:None; MLV:sfv; LANG:en; spamdiagnosticoutput: 1:23 spamdiagnosticmetadata: NSPM X-OriginatorOrg: lebara.com X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Mar 2016 19:12:37.1597 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: d7093539-83cd-4991-b1b3-aacef74cf097 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5PR07MB1495 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104544 SQL doesn't allow decimal numbers for LIMIT. Use decimal numbers it still work but is the proper way. Then clean up a bit your code and remove the commented lines # -----Original Message----- From: Python-list [mailto:python-list-bounces+joaquin.alzola=3Dlebara.com@p= ython.org] On Behalf Of subhabangalore@gmail.com Sent: 10 March 2016 18:12 To: python-list@python.org Subject: Re: Review Request of Python Code On Wednesday, March 9, 2016 at 9:49:17 AM UTC+5:30, subhaba...@gmail.com wr= ote: > Dear Group, > > I am trying to write a code for pulling data from MySQL at the backend an= d annotating words and trying to put the results as separated sentences wit= h each line. The code is generally running fine but I am feeling it may be = better in the end of giving out sentences, and for small data sets it is ok= ay but with 50,000 news articles it is performing dead slow. I am using Pyt= hon2.7.11 on Windows 7 with 8GB RAM. > > I am trying to copy the code here, for your kind review. > > import MySQLdb > import nltk > def sql_connect_NewTest1(): > db =3D MySQLdb.connect(host=3D"localhost", > user=3D"*****", > passwd=3D"*****", > db=3D"abcd_efgh") > cur =3D db.cursor() > #cur.execute("SELECT * FROM newsinput limit 0,50000;") #REPORTING RUN= TIME ERROR > cur.execute("SELECT * FROM newsinput limit 0,50;") > dict_open=3Dopen("/python27/NewTotalTag.txt","r") #OPENING THE DICTIO= NARY FILE > dict_read=3Ddict_open.read() > dict_word=3Ddict_read.split() > a4=3Ddict_word #Assignment for code. > list1=3D[] > flist1=3D[] > nlist=3D[] > for row in cur.fetchall(): > #print row[2] > var1=3Drow[3] > #print var1 #Printing lines > #var2=3Dlen(var1) # Length of file > var3=3Dvar1.split(".") #SPLITTING INTO LINES > #print var3 #Printing The Lines > #list1.append(var1) > var4=3Dlen(var3) #Number of all lines > #print "No",var4 > for line in var3: > #print line > #flist1.append(line) > linew=3Dline.split() > for word in linew: > if word in a4: > windex=3Da4.index(word) > windex1=3Dwindex+1 > word1=3Da4[windex1] > word2=3Dword+"/"+word1 > nlist.append(word2) > #print list1 > #print nlist > elif word not in a4: > word3=3Dword+"/"+"NA" > nlist.append(word3) > #print list1 > #print nlist > else: > print "None" > > #print "###",flist1 > #print len(flist1) > #db.close() > #print nlist > lol =3D lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]= #TRYING TO SPLIT THE RESULTS AS SENTENCES > nlist1=3Dlol(nlist,7) > #print nlist1 > for i in nlist1: > string1=3D" ".join(i) > print i > #print string1 > > > Thanks in Advance. ***************************************************************************= * Dear Group, Thank you all, for your kind time and all suggestions in helping me. Thank you Steve for writing the whole code. It is working full and fine. Bu= t speed is still an issue. We need to speed up. Inada I tried to change to cur =3D db.cursor(MySQLdb.cursors.SSCursor) but my System Admin said that m= ay not be an issue. Freidrich, my problem is I have a big text repository of .txt files in MySQ= L in the backend. I have another list of words with their possible tags. Th= e tags are not conventional Parts of Speech(PoS) tags, and bit defined by = others. The code is expected to read each file and its each line. On reading each line it will scan the list for appropriate tag, if it is fo= und it would assign, else would assign NA. The assignment should be in the format of /tag, so that if there is a strin= g of n words, it should look like, w1/tag w2/tag w3/tag w4/tag ....wn/tag, where tag may be tag in the list or NA as per the situation. This format is taken because the files are expected to be tagged in Brown C= orpus format. There is a Python Library named NLTK. If I want to save my data for use with their models, I need some specificat= ions. I want to use it as Tagged Corpus format. Now the tagged data coming out in this format, should be one tagged sentenc= es in each new line or a lattice. They expect the data to be saved in .pos format but presently I am not doin= g in this code, I may do that later. Please let me know if I need to give any more information. Matt, thank you for if...else suggestion, the data of NewTotalTag.txt is li= ke a simple list of words with unconventional tags, like, w1 tag1 w2 tag2 w3 tag3 ... ... w3 tag3 like that. Regards, Subhabrata -- https://mail.python.org/mailman/listinfo/python-list This email is confidential and may be subject to privilege. If you are not = the intended recipient, please do not copy or disclose its content but cont= act the sender immediately upon receipt.