Groups > comp.lang.python > #53086 > unrolled thread

Improving the web page download code.

Started by	mukesh tiwari <mukeshtiwari.iiitm@gmail.com>
First post	2013-08-27 12:41 -0700
Last post	2013-08-28 08:58 +0000
Articles	7 — 3 participants

Back to article view | Back to comp.lang.python

  Improving the web page download code. mukesh tiwari <mukeshtiwari.iiitm@gmail.com> - 2013-08-27 12:41 -0700
    Re: Improving the web page download code. MRAB <python@mrabarnett.plus.com> - 2013-08-27 21:19 +0100
      Re: Improving the web page download code. mukesh tiwari <mukeshtiwari.iiitm@gmail.com> - 2013-08-27 13:53 -0700
        Re: Improving the web page download code. MRAB <python@mrabarnett.plus.com> - 2013-08-27 23:33 +0100
          Re: Improving the web page download code. mukesh tiwari <mukeshtiwari.iiitm@gmail.com> - 2013-08-27 23:23 -0700
            Re: Improving the web page download code. MRAB <python@mrabarnett.plus.com> - 2013-08-28 16:12 +0100
    Re: Improving the web page download code. Alister <alister.ware@ntlworld.com> - 2013-08-28 08:58 +0000

#53086 — Improving the web page download code.

From	mukesh tiwari <mukeshtiwari.iiitm@gmail.com>
Date	2013-08-27 12:41 -0700
Subject	Improving the web page download code.
Message-ID	<ff1a229a-affa-4d6f-aeab-55762c48a160@googlegroups.com>

Hello All,
I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download)  as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.csv is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.

-Mukesh Tiwari

[1]http://www.alexa.com/topsites. 


import urllib2, os, socket, Queue, thread, signal, sys, re


class Downloader():

	def __init__( self ):
		self.q = Queue.Queue( 200 )
		self.count = 0 
	


	def downloadurl( self ) :
		#open a file in append mode and write the result ( Improvement think of writing in chunks ) 
		with open('titleoutput.dat', 'a+' ) as file :	
			while True :
				try :
					url = self.q.get( )
					data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
					regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
					#Read data line by line and as soon you find the title go out of loop. 
					#title = None
					#for r in data:
					#	if not r :
					#		raise StopIteration
					#	else: 
					#		title = regex.search( r )
					#		if title is not None: break

					title = regex.search( data )
					result =  ', '.join ( [ url , title.group(1) ] )
					#data.close()
					file.write(''.join( [ result , '\n' ] ) )
				except urllib2.HTTPError as e:
				       print ''.join ( [ url, '  ', str ( e ) ] ) 
				except urllib2.URLError as e:
					print ''.join ( [ url, '  ', str ( e ) ] )
				except Exception as e :
					print ''.join ( [ url, '  ', str( e )  ] )
			#With block python calls file.close() automatically.		
				

	def createurl ( self ) :

		#check if file exist. If not then create one with default value of 0 bytes read.
		if os.path.exists('bytesread.dat'):
			f = open ( 'bytesread.dat','r')
			self.count = int ( f.readline() )
					           
		else:
			f=open('bytesread.dat','w')
			f.write('0\n')
			f.close()

		#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
		with open('top-1m.csv', 'r') as file:
			prefix = ''
			file.seek(  self.count * 1024 )
			#you will land into the middle of bytes so discard upto newline
			if ( self.count ): file.readline()	
			for lines in iter ( lambda : file.read( 1024 ) , ''):
				l = lines.split('\n')
				n = len ( l )
				l[0] = ''.join( [ prefix , l[0] ] )
				for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
				prefix = l[n-1]
				self.count += 1

			
	#do graceful exit from here.
	def handleexception ( self , signal , frame) :
		with open('bytesread.dat', 'w') as file:
			print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
			file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
			file.close()			
			sys.exit(0)

if __name__== '__main__':
	u = Downloader()
	signal.signal( signal.SIGINT , u.handleexception)
	thread.start_new_thread ( u.createurl , () )
	for i in xrange ( 5 ) :
		thread.start_new_thread ( u.downloadurl , () )
	while True : pass

[toc] | [next] | [standalone]

#53089

From	MRAB <python@mrabarnett.plus.com>
Date	2013-08-27 21:19 +0100
Message-ID	<mailman.281.1377634802.19984.python-list@python.org>
In reply to	#53086

On 27/08/2013 20:41, mukesh tiwari wrote:
> Hello All,
> I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download)  as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c
sv
>   is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.
>
> -Mukesh Tiwari
>
> [1]http://www.alexa.com/topsites.
>
>
> import urllib2, os, socket, Queue, thread, signal, sys, re
>
>
> class Downloader():
>
> 	def __init__( self ):
> 		self.q = Queue.Queue( 200 )
> 		self.count = 0
> 	
>
>
> 	def downloadurl( self ) :
> 		#open a file in append mode and write the result ( Improvement think of writing in chunks )
> 		with open('titleoutput.dat', 'a+' ) as file :	
> 			while True :
> 				try :
> 					url = self.q.get( )
> 					data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
> 					regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
> 					#Read data line by line and as soon you find the title go out of loop.
> 					#title = None
> 					#for r in data:
> 					#	if not r :
> 					#		raise StopIteration
> 					#	else:
> 					#		title = regex.search( r )
> 					#		if title is not None: break
>
> 					title = regex.search( data )
> 					result =  ', '.join ( [ url , title.group(1) ] )
> 					#data.close()
> 					file.write(''.join( [ result , '\n' ] ) )
> 				except urllib2.HTTPError as e:
> 				       print ''.join ( [ url, '  ', str ( e ) ] )
> 				except urllib2.URLError as e:
> 					print ''.join ( [ url, '  ', str ( e ) ] )
> 				except Exception as e :
> 					print ''.join ( [ url, '  ', str( e )  ] )
> 			#With block python calls file.close() automatically.		
> 				
>
> 	def createurl ( self ) :
>
> 		#check if file exist. If not then create one with default value of 0 bytes read.
> 		if os.path.exists('bytesread.dat'):
> 			f = open ( 'bytesread.dat','r')
> 			self.count = int ( f.readline() )
> 					
> 		else:
> 			f=open('bytesread.dat','w')
> 			f.write('0\n')
> 			f.close()
>
> 		#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
> 		with open('top-1m.csv', 'r') as file:
> 			prefix = ''
> 			file.seek(  self.count * 1024 )
> 			#you will land into the middle of bytes so discard upto newline
> 			if ( self.count ): file.readline()	
> 			for lines in iter ( lambda : file.read( 1024 ) , ''):
> 				l = lines.split('\n')
> 				n = len ( l )
> 				l[0] = ''.join( [ prefix , l[0] ] )
> 				for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
> 				prefix = l[n-1]
> 				self.count += 1
>
> 			
> 	#do graceful exit from here.
> 	def handleexception ( self , signal , frame) :
> 		with open('bytesread.dat', 'w') as file:
> 			print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
> 			file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
> 			file.close()			
> 			sys.exit(0)
>
> if __name__== '__main__':
> 	u = Downloader()
> 	signal.signal( signal.SIGINT , u.handleexception)
> 	thread.start_new_thread ( u.createurl , () )
> 	for i in xrange ( 5 ) :
> 		thread.start_new_thread ( u.downloadurl , () )
> 	while True : pass
> 			
>
My preferred method when working with background threads is to put a 
sentinel such as None at the end and then when a worker gets an item 
from the queue and sees that it's the sentinel, it puts it back in the 
queue for the other workers to see, and then returns (terminates). The 
main thread can then call each worker thread's .join method to wait for 
it to finish. You currently have the main thread running in a 'busy 
loop', consuming processing time doing nothing!

[toc] | [prev] | [next] | [standalone]

#53097

From	mukesh tiwari <mukeshtiwari.iiitm@gmail.com>
Date	2013-08-27 13:53 -0700
Message-ID	<3fff4758-65af-47ae-ab8f-d591679809b7@googlegroups.com>
In reply to	#53089

On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB  wrote:
> On 27/08/2013 20:41, mukesh tiwari wrote:
> 
> > Hello All,
> 
> > I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download)  as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c
> 
> sv
> 
> >   is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.
> 
> >
> 
> > -Mukesh Tiwari
> 
> >
> 
> > [1]http://www.alexa.com/topsites.
> 
> >
> 
> >
> 
> > import urllib2, os, socket, Queue, thread, signal, sys, re
> 
> >
> 
> >
> 
> > class Downloader():
> 
> >
> 
> > 	def __init__( self ):
> 
> > 		self.q = Queue.Queue( 200 )
> 
> > 		self.count = 0
> 
> > 	
> 
> >
> 
> >
> 
> > 	def downloadurl( self ) :
> 
> > 		#open a file in append mode and write the result ( Improvement think of writing in chunks )
> 
> > 		with open('titleoutput.dat', 'a+' ) as file :	
> 
> > 			while True :
> 
> > 				try :
> 
> > 					url = self.q.get( )
> 
> > 					data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
> 
> > 					regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
> 
> > 					#Read data line by line and as soon you find the title go out of loop.
> 
> > 					#title = None
> 
> > 					#for r in data:
> 
> > 					#	if not r :
> 
> > 					#		raise StopIteration
> 
> > 					#	else:
> 
> > 					#		title = regex.search( r )
> 
> > 					#		if title is not None: break
> 
> >
> 
> > 					title = regex.search( data )
> 
> > 					result =  ', '.join ( [ url , title.group(1) ] )
> 
> > 					#data.close()
> 
> > 					file.write(''.join( [ result , '\n' ] ) )
> 
> > 				except urllib2.HTTPError as e:
> 
> > 				       print ''.join ( [ url, '  ', str ( e ) ] )
> 
> > 				except urllib2.URLError as e:
> 
> > 					print ''.join ( [ url, '  ', str ( e ) ] )
> 
> > 				except Exception as e :
> 
> > 					print ''.join ( [ url, '  ', str( e )  ] )
> 
> > 			#With block python calls file.close() automatically.		
> 
> > 				
> 
> >
> 
> > 	def createurl ( self ) :
> 
> >
> 
> > 		#check if file exist. If not then create one with default value of 0 bytes read.
> 
> > 		if os.path.exists('bytesread.dat'):
> 
> > 			f = open ( 'bytesread.dat','r')
> 
> > 			self.count = int ( f.readline() )
> 
> > 					
> 
> > 		else:
> 
> > 			f=open('bytesread.dat','w')
> 
> > 			f.write('0\n')
> 
> > 			f.close()
> 
> >
> 
> > 		#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
> 
> > 		with open('top-1m.csv', 'r') as file:
> 
> > 			prefix = ''
> 
> > 			file.seek(  self.count * 1024 )
> 
> > 			#you will land into the middle of bytes so discard upto newline
> 
> > 			if ( self.count ): file.readline()	
> 
> > 			for lines in iter ( lambda : file.read( 1024 ) , ''):
> 
> > 				l = lines.split('\n')
> 
> > 				n = len ( l )
> 
> > 				l[0] = ''.join( [ prefix , l[0] ] )
> 
> > 				for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
> 
> > 				prefix = l[n-1]
> 
> > 				self.count += 1
> 
> >
> 
> > 			
> 
> > 	#do graceful exit from here.
> 
> > 	def handleexception ( self , signal , frame) :
> 
> > 		with open('bytesread.dat', 'w') as file:
> 
> > 			print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
> 
> > 			file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
> 
> > 			file.close()			
> 
> > 			sys.exit(0)
> 
> >
> 
> > if __name__== '__main__':
> 
> > 	u = Downloader()
> 
> > 	signal.signal( signal.SIGINT , u.handleexception)
> 
> > 	thread.start_new_thread ( u.createurl , () )
> 
> > 	for i in xrange ( 5 ) :
> 
> > 		thread.start_new_thread ( u.downloadurl , () )
> 
> > 	while True : pass
> 
> > 			
> 
> >
> 
> My preferred method when working with background threads is to put a 
> 
> sentinel such as None at the end and then when a worker gets an item 
> 
> from the queue and sees that it's the sentinel, it puts it back in the 
> 
> queue for the other workers to see, and then returns (terminates). The 
> 
> main thread can then call each worker thread's .join method to wait for 
> 
> it to finish. You currently have the main thread running in a 'busy 
> 
> loop', consuming processing time doing nothing!

Hi MRAB,
Thank you for the reply. I wrote this while loop only because of there is no thread.join in thread[1] library but I got your point. I am simply running a while loop for doing nothing. So if somehow I can block the main without too much computation then it will great. 

-Mukesh Tiwari

[1] http://docs.python.org/2/library/thread.html#module-thread

[toc] | [prev] | [next] | [standalone]

#53100

From	MRAB <python@mrabarnett.plus.com>
Date	2013-08-27 23:33 +0100
Message-ID	<mailman.286.1377642783.19984.python-list@python.org>
In reply to	#53097

On 27/08/2013 21:53, mukesh tiwari wrote:
> On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB  wrote:
>> On 27/08/2013 20:41, mukesh tiwari wrote:
>>
[snip]
 >> > if __name__== '__main__':
 >> > 	u = Downloader()
 >> > 	signal.signal( signal.SIGINT , u.handleexception)
 >> > 	thread.start_new_thread ( u.createurl , () )
 >> > 	for i in xrange ( 5 ) :
 >> > 		thread.start_new_thread ( u.downloadurl , () )
 >> > 	while True : pass
 >> > 			
 >> >
 >> My preferred method when working with background threads is to put a
 >> sentinel such as None at the end and then when a worker gets an item
 >> from the queue and sees that it's the sentinel, it puts it back in
 >> the queue for the other workers to see, and then returns
 >> (terminates). The main thread can then call each worker thread's
 >> .join method to wait for it to finish. You currently have the main
 >> thread running in a 'busy loop', consuming processing time doing
 >> nothing!
 >
 > Hi MRAB,
 > Thank you for the reply. I wrote this while loop only because of
 > there is no thread.join in thread[1] library but I got your point. I
 > am simply running a while loop for doing nothing. So if somehow I can
 > block the main without too much computation then it will great.
 >
Why don't you use the 'threading' module instead?


creator = threading.Thread(target=u.createurl)

workers = []
for i in xrange(5):
	workers.append(threading.Thread(target=u.downloadurl))

creator.start()

for w in workers:
	w.start()

creator.join()

for w in workers:
	w.join()

[toc] | [prev] | [next] | [standalone]

#53111

From	mukesh tiwari <mukeshtiwari.iiitm@gmail.com>
Date	2013-08-27 23:23 -0700
Message-ID	<b0c108d9-e75a-44d8-85bd-eed4d0adcc76@googlegroups.com>
In reply to	#53100

On Wednesday, 28 August 2013 04:03:15 UTC+5:30, MRAB  wrote:
> On 27/08/2013 21:53, mukesh tiwari wrote:
> 
> > On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB  wrote:
> 
> >> On 27/08/2013 20:41, mukesh tiwari wrote:
> 
> >>
> 
> [snip]
> 
>  >> > if __name__== '__main__':
> 
>  >> > 	u = Downloader()
> 
>  >> > 	signal.signal( signal.SIGINT , u.handleexception)
> 
>  >> > 	thread.start_new_thread ( u.createurl , () )
> 
>  >> > 	for i in xrange ( 5 ) :
> 
>  >> > 		thread.start_new_thread ( u.downloadurl , () )
> 
>  >> > 	while True : pass
> 
>  >> > 			
> 
>  >> >
> 
>  >> My preferred method when working with background threads is to put a
> 
>  >> sentinel such as None at the end and then when a worker gets an item
> 
>  >> from the queue and sees that it's the sentinel, it puts it back in
> 
>  >> the queue for the other workers to see, and then returns
> 
>  >> (terminates). The main thread can then call each worker thread's
> 
>  >> .join method to wait for it to finish. You currently have the main
> 
>  >> thread running in a 'busy loop', consuming processing time doing
> 
>  >> nothing!
> 
>  >
> 
>  > Hi MRAB,
> 
>  > Thank you for the reply. I wrote this while loop only because of
> 
>  > there is no thread.join in thread[1] library but I got your point. I
> 
>  > am simply running a while loop for doing nothing. So if somehow I can
> 
>  > block the main without too much computation then it will great.
> 
>  >
> 
> Why don't you use the 'threading' module instead?
> 
> 
> 
> 
> 
> creator = threading.Thread(target=u.createurl)
> 
> 
> 
> workers = []
> 
> for i in xrange(5):
> 
> 	workers.append(threading.Thread(target=u.downloadurl))
> 
> 
> 
> creator.start()
> 
> 
> 
> for w in workers:
> 
> 	w.start()
> 
> 
> 
> creator.join()
> 
> 
> 
> for w in workers:
> 
> 	w.join()

Hi MRAB,
Initially I blocked the main using raw_input('') and it was working fine. 

u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
            thread.start_new_thread ( u.downloadurl , () )
#This is for blocking main
raw_input('')
When I pressed  ctrl-c then it's responding fine but now after switching to threading module, I am not able to kill my program using SIGINT ( ctrl-c ). Any idea how to signal SIGINT to threads ? 

Now the changed code and I have to catch the SIGINT. 
        u = Downloader()
        signal.signal( signal.SIGINT , u.handleexception)
        urlcreator = threading.Thread ( target = u.createurl )
        
        workers = []
        for i in xrange ( 5 ):
                workers.append ( threading.Thread( target = u.downloadurl ) )

        urlcreator.start()
        for w in workers:
                w.start()
        
        urlcreator.join()
        for w in workers:
                w.join()

-Mukesh Tiwari

[toc] | [prev] | [next] | [standalone]

#53163

From	MRAB <python@mrabarnett.plus.com>
Date	2013-08-28 16:12 +0100
Message-ID	<mailman.312.1377702753.19984.python-list@python.org>
In reply to	#53111

On 28/08/2013 07:23, mukesh tiwari wrote:
[snip]
> Initially I blocked the main using raw_input('') and it was working fine.
>
> u = Downloader()
> signal.signal( signal.SIGINT , u.handleexception)
> thread.start_new_thread ( u.createurl , () )
> for i in xrange ( 5 ) :
>              thread.start_new_thread ( u.downloadurl , () )
> #This is for blocking main
> raw_input('')
> When I pressed  ctrl-c then it's responding fine but now after switching to threading module, I am not able to kill my program using SIGINT ( ctrl-c ). Any idea how to signal SIGINT to threads ?
>
Try making them daemon threads. A daemon thread is one that will be 
killed when the main thread terminates.

> Now the changed code and I have to catch the SIGINT.
>          u = Downloader()
>          signal.signal( signal.SIGINT , u.handleexception)
>          urlcreator = threading.Thread ( target = u.createurl )
>
>          workers = []
>          for i in xrange ( 5 ):
>                  workers.append ( threading.Thread( target = u.downloadurl ) )
>
            urlcreator.daemon = True
>          urlcreator.start()

>          for w in workers:
            urlcreator.daemon = True
                    w.daemon = True
>                  w.start()
>
>          urlcreator.join()
>          for w in workers:
>                  w.join()
>

[toc] | [prev] | [next] | [standalone]

#53118

From	Alister <alister.ware@ntlworld.com>
Date	2013-08-28 08:58 +0000
Message-ID	<NYiTt.13568$rT5.8150@fx01.am4>
In reply to	#53086

On Tue, 27 Aug 2013 12:41:10 -0700, mukesh tiwari wrote:

> Hello All,
> I am doing web stuff first time in python so I am looking for
> suggestions. I wrote this code to download the title of webpages using
> as much less resource ( server time, data download)  as possible and
> should be fast enough. Initially I used BeautifulSoup for parsing but
> the person who is going to use this code asked me not to use this and
> use regular expressions ( The reason was BeautifulSoup is not fast
> enough ? ).

By the time you have written enough RE to reliably parse HTML(I ma not 
sure that that is even strictly possible) you will have re-inverted 
BeautifullSoup, Badly. unless you are looking for a very explicit section 
of data in the page this is not a good idea.

[toc] | [prev] | [standalone]

csiph-web

Improving the web page download code.

Contents

#53086 — Improving the web page download code.

#53089

#53097

#53100

#53111

#53163

#53118