Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!news1.tnib.de!feed.news.tnib.de!news.tnib.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'essentially': 0.04; 'python': 0.09; 'expected.': 0.09; 'fetch': 0.09; 'structure,': 0.09; 'wrong,': 0.09; 'absolute,': 0.16; 'background,': 0.16; 'etc...': 0.16; 'non-trivial': 0.16; 'specific,': 0.16; 'succeeds,': 0.16; 'urllib': 0.16; 'later': 0.16; 'wrote:': 0.17; 'file.': 0.20; 'trying': 0.21; 'css,': 0.22; 'location,': 0.22; 'example': 0.23; 'absolute': 0.23; 'specified': 0.23; "haven't": 0.23; 'downloaded': 0.24; 'external': 0.24; 'host': 0.24; 'tried': 0.25; 'header:In-Reply-To:1': 0.25; 'header:User-Agent:1': 0.26; 'looks': 0.26; 'am,': 0.27; 'css': 0.27; 'noticed': 0.28; 'attempted': 0.29; 'header,': 0.29; 'lessons': 0.29; 'initially': 0.30; 'code': 0.31; 'getting': 0.33; 'html,': 0.33; 'to:addr :python-list': 0.33; 'text': 0.34; 'similar': 0.35; 'there': 0.35; 'but': 0.36; "didn't": 0.36; 'method': 0.36; 'editor': 0.37; 'subject:: ': 0.38; 'files': 0.38; 'fact': 0.38; 'page': 0.38; 'to:addr:python.org': 0.39; 'received:192': 0.39; 'little': 0.39; 'received:192.168': 0.40; 'lost': 0.60; 'further': 0.61; 'first': 0.61; 'email addr:gmail.com': 0.63; 'more': 0.63; 'further,': 0.71; 'received:74.208': 0.71; '(url)': 0.84; 'received:74.208.4.194': 0.84 Date: Fri, 22 Feb 2013 12:05:30 -0500 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: python-list@python.org Subject: Re: Urllib's urlopen and urlretrieve References: <34998ea2-6b19-4a98-8ea0-389aca0192ca@googlegroups.com> <07234607-bd77-4ecb-8a19-3c71e9b4f0b4@googlegroups.com> In-Reply-To: <07234607-bd77-4ecb-8a19-3c71e9b4f0b4@googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:PqeOZp3w5xvGYSW0fkxU0XB73bOX0DPsVZvY2CTP/Q8 qhi07BPKGhvnrMfFZR2Qm2AOXBIEWRAzLpasbIxQ/2sCNdOgM6 mZB7FR5irRrFax825PUK6xT4nsB+IUlH719a7LjitDRTJdOwoG XPuSzGG7tr3pPPCkraHtUJP2/pClH2on4BoI+HD/HnAVTPI3ii lzk9STpwiZrmpl9wqhtEca+/Uwhv2zvrkje3xjSl/O9YsLmVQU ZjfNYbMqoXdLss7owIXlu31KRbavc31PBj/f8tfcq0Y5mX6+Ha eaD8qvQbHha1pSpRoLZJ8bpxG8ApIWVaphD2UqLXeFmffycnw= = X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 24 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1361552757 news.xs4all.nl 6963 [2001:888:2000:d::a6]:32993 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:39592 On 02/22/2013 12:09 AM, qoresucks@gmail.com wrote: > Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected. > > To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing. So how are you opening this html? In a text editor that somehow added colors? Or were you opening it in a browser? In order for a browser to render a non-trivial page, it may need lots of files other than the html. Colors for example can be specified inline, in the header, or in an external css file. If the page was designed to use the external css, and it's missing or not in the right location, then the browser is going to get the colors wrong. Further, if the location (url) is relative, then you can create a similar directory structure, and the browser will find it. But if it's absolute, then the browser is going to try to go out to the web to fetch it. If it succeeds, then it's masking the fact that you haven't downloaded the "whole web site." The same is true for other external refs. It may be impossible to host it elsewhere if there are any absolute urls. -- DaveA