Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #12990

Re: Create an index from a webpage [RANT, DNFTT]

Date 2011-09-09 12:09 +1000
From Simon Cropper <simoncropper@fossworkflowguides.com>
Subject Re: Create an index from a webpage [RANT, DNFTT]
References <mailman.874.1315484806.27778.python-list@python.org> <1537032.qVoOGUtdWV@PointedEars.de> <4e68db21$0$30002$c3e8da3$5496439d@news.astraweb.com> <mailman.886.1315525252.27778.python-list@python.org> <op.v1imgsvpa8ncjz@gnudebst>
Newsgroups comp.lang.python
Message-ID <mailman.891.1315534207.27778.python-list@python.org> (permalink)

Show all headers | View raw


On 09/09/11 10:32, Rhodri James wrote:
> On Fri, 09 Sep 2011 00:40:42 +0100, Simon Cropper
>
> Ahem. You should expect a certain amount of ribbing after admitting that
> your Google-fu is weak. So is mine, but hey.

I did not admit anything. I consider my ability to find this quite good 
actually. Others assumed that my "Google-fu is weak".

>
>> 4. If someone is willing to help me, rather than lecture me (or poke
>> me to see if they get a response), I would appreciate it.
>
> The Google Python Sitemap Generator
> (http://www.smart-it-consulting.com/article.htm?node=166&page=128,
> fourth offering when you google "map a website with Python") looks like
> a promising start. At least it produces something in XML -- filtering
> that and turning it into HTML should be fairly straightforward.
>

I saw this in my original search. My conclusions were..

1. The last update was in 2005. That is 6 years ago. In that time we 
have had numerous upgrades to HTML, Logs, etc.
2. The script expects to run on the webserver. I don't have the ability 
to run python on my webserver.
3. There are also a number of dead-links and redirects to Google 
Webmaster Central / Tools, which then request you submit a sitemap (as I 
alluded we get into a circular confusing cross-referencing situation)
4. The ultimate product - if you can get the package to work - would be 
a XML file you would need to massage to extract what you needed.

To me this seems like overkill.

I assume you could import the parent html file, scrap all the links on 
the same domain, dump these to a hierarchical list and represent this in 
HTML using BeautifulSoup or something similar. Certainly doable but 
considering the shear commonality of this task I don't understand why a 
simple script does not already exist - hence my original request for 
assistance.

It would appear from the feedback so far this 'forum' is not the most 
appropriate to ask this question. Consequently, I will take your advice 
and keep looking... and if I don't find something within a reasonable 
time frame, just write something myself.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Create an index from a webpage Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-08 22:26 +1000
  Re: Create an index from a webpage Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-09-08 14:38 +0200
    Re: Create an index from a webpage Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 01:11 +1000
      Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 09:40 +1000
        Re: Create an index from a webpage [RANT, DNFTT] "Rhodri James" <rhodri@wildebst.demon.co.uk> - 2011-09-09 01:32 +0100
          Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 12:09 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 12:16 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Duncan Booth <duncan.booth@invalid.invalid> - 2011-09-09 10:29 +0000
        Re: Create an index from a webpage [RANT, DNFTT] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 12:14 +1000
          Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 12:43 +1000
          Re: Create an index from a webpage [RANT, DNFTT] Chris Angelico <rosuav@gmail.com> - 2011-09-09 12:59 +1000
          Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 13:20 +1000
          Re: Create an index from a webpage [RANT, DNFTT] Chris Angelico <rosuav@gmail.com> - 2011-09-09 13:46 +1000

csiph-web