Find relative url in mixed text/html

From	Rob Hills <rhills@medimorphosis.com.au>
Newsgroups	comp.lang.python
Subject	Find relative url in mixed text/html
Date	2015-11-28 10:35 +0800
Message-ID	<mailman.182.1448678122.20593.python-list@python.org> (permalink)

Show all headers | View raw

Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
sometimes just text: ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms. 

Some examples of what I am talking about are:

    Post fragment containing an "Absolute" cross-link:

    <br />ive made a new thread:
    <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
    <br />

    converts to:

    <br />
    <br />ive made a new thread:
    <br />/viewtopic.php?t=316&p=1958#1958

    Post fragment containing a "Relative" cross-link:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

    Needs converting to:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

-- 
Rob Hills
Waikiki, Western Australia

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
  Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
    Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
      Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
  Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800

csiph-web