Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Fri, 09 Sep 2011 12:09:58 +1000
From: Simon Cropper <simoncropper@fossworkflowguides.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.21) Gecko/20110831 Lightning/1.0b2 Thunderbird/3.1.13
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Create an index from a webpage [RANT, DNFTT]
References: <mailman.874.1315484806.27778.python-list@python.org>	<1537032.qVoOGUtdWV@PointedEars.de>	<4e68db21$0$30002$c3e8da3$5496439d@news.astraweb.com>	<mailman.886.1315525252.27778.python-list@python.org> <op.v1imgsvpa8ncjz@gnudebst>
In-Reply-To: <op.v1imgsvpa8ncjz@gnudebst>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.891.1315534207.27778.python-list@python.org>
Lines: 56
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:12990

On 09/09/11 10:32, Rhodri James wrote:
> On Fri, 09 Sep 2011 00:40:42 +0100, Simon Cropper
>
> Ahem. You should expect a certain amount of ribbing after admitting that
> your Google-fu is weak. So is mine, but hey.

I did not admit anything. I consider my ability to find this quite good 
actually. Others assumed that my "Google-fu is weak".

>
>> 4. If someone is willing to help me, rather than lecture me (or poke
>> me to see if they get a response), I would appreciate it.
>
> The Google Python Sitemap Generator
> (http://www.smart-it-consulting.com/article.htm?node=166&page=128,
> fourth offering when you google "map a website with Python") looks like
> a promising start. At least it produces something in XML -- filtering
> that and turning it into HTML should be fairly straightforward.
>

I saw this in my original search. My conclusions were..

1. The last update was in 2005. That is 6 years ago. In that time we 
have had numerous upgrades to HTML, Logs, etc.
2. The script expects to run on the webserver. I don't have the ability 
to run python on my webserver.
3. There are also a number of dead-links and redirects to Google 
Webmaster Central / Tools, which then request you submit a sitemap (as I 
alluded we get into a circular confusing cross-referencing situation)
4. The ultimate product - if you can get the package to work - would be 
a XML file you would need to massage to extract what you needed.

To me this seems like overkill.

I assume you could import the parent html file, scrap all the links on 
the same domain, dump these to a hierarchical list and represent this in 
HTML using BeautifulSoup or something similar. Certainly doable but 
considering the shear commonality of this task I don't understand why a 
simple script does not already exist - hence my original request for 
assistance.

It would appear from the feedback so far this 'forum' is not the most 
appropriate to ask this question. Consequently, I will take your advice 
and keep looking... and if I don't find something within a reasonable 
time frame, just write something myself.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com