Groups > comp.lang.python > #12958 > unrolled thread

Create an index from a webpage

Started by	Simon Cropper <simoncropper@fossworkflowguides.com>
First post	2011-09-08 22:26 +1000
Last post	2011-09-09 13:46 +1000
Articles	13 — 6 participants

Back to article view | Back to comp.lang.python

  Create an index from a webpage Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-08 22:26 +1000
    Re: Create an index from a webpage Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-09-08 14:38 +0200
      Re: Create an index from a webpage Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 01:11 +1000
        Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 09:40 +1000
          Re: Create an index from a webpage [RANT, DNFTT] "Rhodri James" <rhodri@wildebst.demon.co.uk> - 2011-09-09 01:32 +0100
            Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 12:09 +1000
              Re: Create an index from a webpage [RANT, DNFTT] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 12:16 +1000
              Re: Create an index from a webpage [RANT, DNFTT] Duncan Booth <duncan.booth@invalid.invalid> - 2011-09-09 10:29 +0000
          Re: Create an index from a webpage [RANT, DNFTT] Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-09-09 12:14 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 12:43 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Chris Angelico <rosuav@gmail.com> - 2011-09-09 12:59 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Simon Cropper <simoncropper@fossworkflowguides.com> - 2011-09-09 13:20 +1000
            Re: Create an index from a webpage [RANT, DNFTT] Chris Angelico <rosuav@gmail.com> - 2011-09-09 13:46 +1000

#12958 — Create an index from a webpage

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2011-09-08 22:26 +1000
Subject	Create an index from a webpage
Message-ID	<mailman.874.1315484806.27778.python-list@python.org>

Hi,

I am getting dizzy on google.

I am after a way of pointing a python routine to my website and have it 
create a tree, represented as a hierarchical HTML list in a webpage, of 
all the pages in that website (recursive list of internal links to HTML 
documents; ignore images, etc.).

It is essentially a contents page or sitemap for the site.

Interestingly, despite trying quite a few keyword combinations, I was 
unable to find such a script.

Anyone have any ideas?

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

[toc] | [next] | [standalone]

#12959

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2011-09-08 14:38 +0200
Message-ID	<1537032.qVoOGUtdWV@PointedEars.de>
In reply to	#12958

Simon Cropper wrote:

> I am after a way of pointing a python routine to my website and have it
> create a tree, represented as a hierarchical HTML list in a webpage, of
> all the pages in that website (recursive list of internal links to HTML
> documents; ignore images, etc.).
> 
> It is essentially a contents page or sitemap for the site.

<http://lmgtfy.com/?q=python+sitemap>

If all else fails, use markup parsers like 

- <http://www.crummy.com/software/BeautifulSoup/>
- <http://lxml.de/>

and write it yourself.  It is not hard to do.

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [next] | [standalone]

#12967

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-09 01:11 +1000
Message-ID	<4e68db21$0$30002$c3e8da3$5496439d@news.astraweb.com>
In reply to	#12959

Thomas 'PointedEars' Lahn wrote:

> <http://lmgtfy.com/?q=python+sitemap>

[climbs up on the soapbox and begins rant]

Please don't use lmgtfy. The joke, such as it is, stopped being funny about
three years ago. It's just annoying, and besides, it doesn't even work
without Javascript. Kids today have no respect, get off my lawn, grump
grump grump...

It's no harder to put the search terms into a google URL, which still gets
the point across without being a dick about it:

www.google.com/search?q=python+sitemap

[ends rant, climbs back down off soapbox]

Or better still, use a search engine that doesn't track and bubble your
searches:

https://duckduckgo.com/html/?q=python+sitemap

You can even LMDDGTFY if you insist.

http://lmddgtfy.com/


Completely-undermining-my-own-message-ly y'rs, 


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#12981 — Re: Create an index from a webpage [RANT, DNFTT]

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2011-09-09 09:40 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.886.1315525252.27778.python-list@python.org>
In reply to	#12967

On 09/09/11 01:11, Steven D'Aprano wrote:
> [SNIP]
> It's no harder to put the search terms into a google URL, which still gets
> the point across without being a dick about it:
 > [SNIP]

[RANT]

OK I was not going to say anything but...

1. Being told to google-it when I explicitly stated in my initial post 
that I had been doing this and had not been able to find anything is 
just plain rude. It is unconstructive and irritating.

2. I presume that python-list is a mail list for python users - 
beginners, intermediate and advanced. If it is not then tell me and I 
will go somewhere else.

3. Some searches, particularly for common terms throw millions of hits. 
'Python' returns 147,000,000 results on google, 'Sitemap' returns 
1,410,000,000 results. Even 'Python AND Sitemap' still returns 5,020 
results. Working through these links takes you round and round with no 
clear solutions. Asking for help on the primary python mail list -- 
after conducting a preliminary investigation for tools, libraries, code 
snippets seemed legitimate.

4. AND YES, I could write a program but why recreate code when there is 
a strong likelihood that code already exists. One of the advantages of 
python is that a lot of code is redistributed under licences that 
promote reuse. So why reinvent the wheel when their is a library full of 
code. Sometimes you just need help finding the door.

4. If someone is willing to help me, rather than lecture me (or poke me 
to see if they get a response), I would appreciate it.

[END RANT]

For people that are willing to help. My original request was...

I am after a way of pointing a python routine to my website and have it
create a tree, represented as a hierarchical HTML list in a webpage, of
all the pages in that website (recursive list of internal links to HTML
documents; ignore images, etc.).

In subsequent notes to Thomas 'PointedEars'...

I pointed to an example of the desired output here 
http://lxml.de/sitemap.html

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

[toc] | [prev] | [next] | [standalone]

#12984 — Re: Create an index from a webpage [RANT, DNFTT]

From	"Rhodri James" <rhodri@wildebst.demon.co.uk>
Date	2011-09-09 01:32 +0100
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<op.v1imgsvpa8ncjz@gnudebst>
In reply to	#12981

On Fri, 09 Sep 2011 00:40:42 +0100, Simon Cropper  
<simoncropper@fossworkflowguides.com> wrote:

> On 09/09/11 01:11, Steven D'Aprano wrote:
>> [SNIP]
>> It's no harder to put the search terms into a google URL, which still  
>> gets
>> the point across without being a dick about it:
>  > [SNIP]
>
> [RANT]
>
> OK I was not going to say anything but...

Ahem.  You should expect a certain amount of ribbing after admitting that  
your Google-fu is weak.  So is mine, but hey.

> 4. If someone is willing to help me, rather than lecture me (or poke me  
> to see if they get a response), I would appreciate it.

The Google Python Sitemap Generator  
(http://www.smart-it-consulting.com/article.htm?node=166&page=128, fourth  
offering when you google "map a website with Python") looks like a  
promising start.  At least it produces something in XML -- filtering that  
and turning it into HTML should be fairly straightforward.

-- 
Rhodri James *-* Wildebeest Herder to the Masses

[toc] | [prev] | [next] | [standalone]

#12990 — Re: Create an index from a webpage [RANT, DNFTT]

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2011-09-09 12:09 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.891.1315534207.27778.python-list@python.org>
In reply to	#12984

On 09/09/11 10:32, Rhodri James wrote:
> On Fri, 09 Sep 2011 00:40:42 +0100, Simon Cropper
>
> Ahem. You should expect a certain amount of ribbing after admitting that
> your Google-fu is weak. So is mine, but hey.

I did not admit anything. I consider my ability to find this quite good 
actually. Others assumed that my "Google-fu is weak".

>
>> 4. If someone is willing to help me, rather than lecture me (or poke
>> me to see if they get a response), I would appreciate it.
>
> The Google Python Sitemap Generator
> (http://www.smart-it-consulting.com/article.htm?node=166&page=128,
> fourth offering when you google "map a website with Python") looks like
> a promising start. At least it produces something in XML -- filtering
> that and turning it into HTML should be fairly straightforward.
>

I saw this in my original search. My conclusions were..

1. The last update was in 2005. That is 6 years ago. In that time we 
have had numerous upgrades to HTML, Logs, etc.
2. The script expects to run on the webserver. I don't have the ability 
to run python on my webserver.
3. There are also a number of dead-links and redirects to Google 
Webmaster Central / Tools, which then request you submit a sitemap (as I 
alluded we get into a circular confusing cross-referencing situation)
4. The ultimate product - if you can get the package to work - would be 
a XML file you would need to massage to extract what you needed.

To me this seems like overkill.

I assume you could import the parent html file, scrap all the links on 
the same domain, dump these to a hierarchical list and represent this in 
HTML using BeautifulSoup or something similar. Certainly doable but 
considering the shear commonality of this task I don't understand why a 
simple script does not already exist - hence my original request for 
assistance.

It would appear from the feedback so far this 'forum' is not the most 
appropriate to ask this question. Consequently, I will take your advice 
and keep looking... and if I don't find something within a reasonable 
time frame, just write something myself.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

[toc] | [prev] | [next] | [standalone]

#12992 — Re: Create an index from a webpage [RANT, DNFTT]

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-09 12:16 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<4e6976ec$0$29987$c3e8da3$5496439d@news.astraweb.com>
In reply to	#12990

Simon Cropper wrote:

> Certainly doable but
> considering the shear commonality of this task I don't understand why a
> simple script does not already exist

Perhaps it isn't as common or as simple as you believe.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#13016 — Re: Create an index from a webpage [RANT, DNFTT]

From	Duncan Booth <duncan.booth@invalid.invalid>
Date	2011-09-09 10:29 +0000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<Xns9F5B73D529CD0duncanbooth@127.0.0.1>
In reply to	#12990

Simon Cropper <simoncropper@fossworkflowguides.com> wrote:

> Certainly doable but 
> considering the shear commonality of this task I don't understand why a 
> simple script does not already exist - hence my original request for 
> assistance.

I think you may have underestimated the complexity of the task in general.

To do it for a remote website you need to specify what you consider to be a 
unique page. Here are some questions:

Is case significant for URLs (technically it always is, but IIS sites tend 
to ignore it and to contain links with random permutations of case)?

Are there any query parameters that make two pages distinct? Or any 
parameters that you should ignore? Is the order of parameters significant? 
I recently came across a site that not only had multiple links to identical 
pages with the query parameters in different order but also used a non-
standard % to separate parameters instead of &: it's not so easy getting 
crawlers to handle that mess.

Even after ignoring query parameters are there a finite number of pages to 
the site?
For example, Apache has a spelling correction module that can effectively 
allow any number of spurious subfolders: I've seen a site where 
"/folder1/index.html" had a link to "folder2/index.html" and 
"/folder2/index.html" linked to "folder1/index.html". Apache helpfully 
accepted /folder2/folder1/ as equivalent to /folder1/ and therefore by 
extension also accepted /folder2/folder1/folder2/folder1/...
Zope is also good at creating infinite folder structures.

If you want to spider a remote site then there are plenty of off the shelf 
spidering packages, e.g. httrack. They have a lot of configuration options 
to try to handle the above gotchas.

Your case is probably a lot simpler, but that's just a few reasons why it 
isn't actually a trivial task. Building a list by scanning a bunch of 
folders with html files is comparatively easy which is why that is almost 
always the preferred solution if possible.

-- 
Duncan Booth http://kupuguy.blogspot.com

[toc] | [prev] | [next] | [standalone]

#12991 — Re: Create an index from a webpage [RANT, DNFTT]

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-09-09 12:14 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<4e69769f$0$29987$c3e8da3$5496439d@news.astraweb.com>
In reply to	#12981

Simon Cropper wrote:

> 1. Being told to google-it when I explicitly stated in my initial post
> that I had been doing this and had not been able to find anything is
> just plain rude. It is unconstructive and irritating.

Why so you did. Even though I wasn't the one who told you to google it, I'll
apologise too because I was thinking the same thing. Sorry about that.

> 3. Some searches, particularly for common terms throw millions of hits.
> 'Python' returns 147,000,000 results on google, 'Sitemap' returns
> 1,410,000,000 results. Even 'Python AND Sitemap' still returns 5,020
> results. 

How about "python generate a site map"? The very first link on DuckDuckGo is
this:

http://www.conversationmarketing.com/2010/08/python-sitemap-crawler-1.htm

Despite the domain, there is actual Python code on the page. Unfortunately
it looks like crappy code with broken formatting and a mix of <\br> tags,
but it's a start.

Searching for "site map" on PyPI returns a page full of hits:

http://pypi.python.org/pypi?%3Aaction=search&term=site+map&submit=search

Most of them seem to rely on a framework like Django etc, but you might find
something useful.

> 4. AND YES, I could write a program but why recreate code when there is
> a strong likelihood that code already exists.

"Strong" likelihood? Given how hard it is to find an appropriate sitemap
generator written in Python, I'd say there is a strong likelihood that one
that meets your needs and is publicly available under an appropriate
licence is vanishingly small.

If you do decide to write your own, please consider uploading it to PyPI
under a FOSS licence.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#12993 — Re: Create an index from a webpage [RANT, DNFTT]

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2011-09-09 12:43 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.892.1315536245.27778.python-list@python.org>
In reply to	#12991

On 09/09/11 12:14, Steven D'Aprano wrote:
> If you do decide to write your own, please consider uploading it to PyPI
> under a FOSS licence.

At present I am definitely getting the impression that my assumption 
that something like this' must out there', is wrong.

I am following people's links and suggestions (as well as my own; I have 
spent 1-2 hours looking) but have not found anything that is able to be 
used with only minor adjustments.

I have found a XML-Sitemaps Generator at http://www.xml-sitemaps.com,
this page allows you to create the XML files that can be uploaded to 
google. But as stated I don't actually want what people now call 
'sitemaps' I want a automatically updated 'index / contents page' to my 
website. For example, if I add a tutorial or update any of my links I 
want the 'global contents page' to be updated when the python script is run.

I am now considering how I might address this requirement. If I create a 
python script I will post it on PyPI. As with all my work it will be 
released under the GPLv3 licence.

Thanks for your help.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

[toc] | [prev] | [next] | [standalone]

#12995 — Re: Create an index from a webpage [RANT, DNFTT]

From	Chris Angelico <rosuav@gmail.com>
Date	2011-09-09 12:59 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.893.1315537151.27778.python-list@python.org>
In reply to	#12991

On Fri, Sep 9, 2011 at 12:43 PM, Simon Cropper
<simoncropper@fossworkflowguides.com> wrote:
> At present I am definitely getting the impression that my assumption that
> something like this' must out there', is wrong.
>
> I have found a XML-Sitemaps Generator at http://www.xml-sitemaps.com,
> this page allows you to create the XML files that can be uploaded to google.
> But as stated I don't actually want what people now call 'sitemaps' I want a
> automatically updated 'index / contents page' to my website. For example, if
> I add a tutorial or update any of my links I want the 'global contents page'
> to be updated when the python script is run.

What you're looking at may be closer to autogenerated documentation
than to a classic site map. There are a variety of tools that generate
HTML pages on the basis of *certain information found in* all the
files in a directory (as opposed to the entire content of those
files). What you're trying to do may be sufficiently specific that it
doesn't already exist, but it might be worth having a quick look at
autodoc/doxygen - at least for some ideas.

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#12998 — Re: Create an index from a webpage [RANT, DNFTT]

From	Simon Cropper <simoncropper@fossworkflowguides.com>
Date	2011-09-09 13:20 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.895.1315538412.27778.python-list@python.org>
In reply to	#12991

On 09/09/11 12:59, Chris Angelico wrote:
> On Fri, Sep 9, 2011 at 12:43 PM, Simon Cropper
> <simoncropper@fossworkflowguides.com>  wrote:
>> At present I am definitely getting the impression that my assumption that
>> something like this' must out there', is wrong.
>>
>> I have found a XML-Sitemaps Generator at http://www.xml-sitemaps.com,
>> this page allows you to create the XML files that can be uploaded to google.
>> But as stated I don't actually want what people now call 'sitemaps' I want a
>> automatically updated 'index / contents page' to my website. For example, if
>> I add a tutorial or update any of my links I want the 'global contents page'
>> to be updated when the python script is run.
>
> What you're looking at may be closer to autogenerated documentation
> than to a classic site map. There are a variety of tools that generate
> HTML pages on the basis of *certain information found in* all the
> files in a directory (as opposed to the entire content of those
> files). What you're trying to do may be sufficiently specific that it
> doesn't already exist, but it might be worth having a quick look at
> autodoc/doxygen - at least for some ideas.
>
> Chris Angelico

Chris,

You assessment is correct. Working through the PyPI I am having better 
luck with using different terms than the old-term 'sitemap'.

I have found a link to funnelweb which uses the transmogrify library 
(yeah, as if I would have typed this term into google!) that is 
described as "Crawl and parse static sites and import to Plone".

http://pypi.python.org/pypi/funnelweb/1.0

As funnelweb is modular, using a variety of the transmogrify tools, 
maybe I could modify this to create a 'non-plone' version.

-- 
Cheers Simon

    Simon Cropper - Open Content Creator / Website Administrator

    Free and Open Source Software Workflow Guides
    ------------------------------------------------------------
    Introduction               http://www.fossworkflowguides.com
    GIS Packages               http://gis.fossworkflowguides.com
    bash / Python        http://scripting.fossworkflowguides.com

[toc] | [prev] | [next] | [standalone]

#12999 — Re: Create an index from a webpage [RANT, DNFTT]

From	Chris Angelico <rosuav@gmail.com>
Date	2011-09-09 13:46 +1000
Subject	Re: Create an index from a webpage [RANT, DNFTT]
Message-ID	<mailman.896.1315539988.27778.python-list@python.org>
In reply to	#12991

On Fri, Sep 9, 2011 at 1:20 PM, Simon Cropper
<simoncropper@fossworkflowguides.com> wrote:
> Chris,
>
> You assessment is correct. Working through the PyPI I am having better luck
> with using different terms than the old-term 'sitemap'.
>
> I have found a link to funnelweb which uses the transmogrify library (yeah,
> as if I would have typed this term into google!) that is described as "Crawl
> and parse static sites and import to Plone".
>

And once again, python-list has turned a rant into a useful,
informative, and productive thread :)

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Create an index from a webpage

Contents

#12958 — Create an index from a webpage

#12959

#12967

#12981 — Re: Create an index from a webpage [RANT, DNFTT]

#12984 — Re: Create an index from a webpage [RANT, DNFTT]

#12990 — Re: Create an index from a webpage [RANT, DNFTT]

#12992 — Re: Create an index from a webpage [RANT, DNFTT]

#13016 — Re: Create an index from a webpage [RANT, DNFTT]

#12991 — Re: Create an index from a webpage [RANT, DNFTT]

#12993 — Re: Create an index from a webpage [RANT, DNFTT]

#12995 — Re: Create an index from a webpage [RANT, DNFTT]

#12998 — Re: Create an index from a webpage [RANT, DNFTT]

#12999 — Re: Create an index from a webpage [RANT, DNFTT]