Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #52923
| From | Piet van Oostrum <piet@vanoostrum.org> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: Python script help |
| Date | 2013-08-23 22:37 -0400 |
| Message-ID | <m2ioyv3jc5.fsf@cochabamba.vanoostrum.org> (permalink) |
| References | <4566d0e7-2576-4d09-83f5-fca3b370710a@googlegroups.com> <e47a83a9-14cc-4596-b17c-d38c5f300151@googlegroups.com> |
cool1574@gmail.com writes:
> Here are some scripts, how do I put them together to create the script
> I want? (to search a online document and download all the links in it)
> p.s: can I set a destination folder for the downloads?
You can use os.chdir to go to the desired folder.
>
> urllib.urlopen("http://....")
>
> possible_urls = re.findall(r'\S+:\S+', text)
>
> import urllib2
> response = urllib2.urlopen('http://www.example.com/')
> html = response.read()
If you insist on not using wget, here is a simple script with
BeautifulSoup (v4):
########################################################################
from bs4 import BeautifulSoup
from urllib2 import urlopen
from urlparse import urljoin
import os
import re
os.chdir('OUT')
def generate_filename(url):
url = re.sub('^[a-zA-Z0-9+.-]+:/*', '', url)
return url.replace('/', '_')
URL = "http://www.example.com/"
soup = BeautifulSoup(urlopen(URL).read())
links = soup.select('a[href]')
for link in links:
url = urljoin(URL, link['href'])
print url
html = urlopen(url).read()
fn = generate_filename(url)
with open(fn, 'wb') as outfile:
outfile.write(html)
########################################################################
You should add a more intelligent filename generator, filter out mail:
urls and possibly others and add exception handling for HTTP errors.
--
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Python script help cool1574@gmail.com - 2013-07-30 07:49 -0700
Re: Python script help Chris Angelico <rosuav@gmail.com> - 2013-07-30 16:38 +0100
Re: Python script help cool1574@gmail.com - 2013-07-30 08:49 -0700
Re: Python script help Chris Angelico <rosuav@gmail.com> - 2013-07-30 16:58 +0100
Re: Python script help cool1574@gmail.com - 2013-07-30 09:10 -0700
Re: Python script help cool1574@gmail.com - 2013-07-30 09:12 -0700
Re: Python script help Cameron Simpson <cs@zip.com.au> - 2013-07-31 07:47 +1000
Re: Python script help Joshua Landau <joshua@landau.ws> - 2013-07-31 07:24 +0100
Re: Python script help Chris Angelico <rosuav@gmail.com> - 2013-07-30 17:22 +0100
Re: Python script help Vincent Vande Vyvre <vincent.vandevyvre@swing.be> - 2013-07-30 18:58 +0200
Re: Python script help Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-07-30 17:32 +0200
Re: Python script help Denis McMahon <denismfmcmahon@gmail.com> - 2013-07-31 05:08 +0000
Re: Python script help cool1574@gmail.com - 2013-07-31 01:15 -0700
Re: Python script help alex23 <wuwei23@gmail.com> - 2013-08-01 10:57 +1000
Re: Python script help Alister <alister.ware@ntlworld.com> - 2013-08-01 10:39 +0000
Re: Python script help Piet van Oostrum <piet@vanoostrum.org> - 2013-08-23 22:37 -0400
Re: Python script help cool1574@gmail.com - 2013-08-01 09:02 -0700
Re: Python script help Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-08-02 10:44 +0200
Re: Python script help cool1574@gmail.com - 2013-08-02 02:46 -0700
Re: Python script help Chris Angelico <rosuav@gmail.com> - 2013-08-02 11:01 +0100
Re: Python script help cool1574@gmail.com - 2013-08-04 08:57 -0700
Re: Python script help Chris Angelico <rosuav@gmail.com> - 2013-08-04 17:20 +0100
Re: Python script help Michael Torrie <torriem@gmail.com> - 2013-08-04 16:58 -0600
Re: Python script help Jake Angulo <jake.angulo@gmail.com> - 2013-08-05 10:30 +1000
csiph-web