Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Rob Hills <rhills@medimorphosis.com.au>
Newsgroups: comp.lang.python
Subject: Find relative url in mixed text/html
Date: Sat, 28 Nov 2015 10:35:16 +0800
Lines: 67
Message-ID: <mailman.182.1448678122.20593.python-list@python.org>
Reply-To: rhills@medimorphosis.com.au
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
Precedence: list
Xref: csiph.com comp.lang.python:99651

Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( <a href=3D... ), BBCode links ( [url]http://... [/url] ) or
sometimes just text: ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms.=20

Some examples of what I am talking about are:

    Post fragment containing an "Absolute" cross-link:

    <br />ive made a new thread:
    <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=3D316&PID=3D1=
958#1958
    <br />

    converts to:

    <br />
    <br />ive made a new thread:
    <br />/viewtopic.php?t=3D316&p=3D1958#1958

    Post fragment containing a "Relative" cross-link:

    <font size=3D"3"><u>Battery Management System</u></font><br /><a href=
=3D"/forum_posts.asp?TID=3D980&PID=3D15479#15479" target=3D"_blank" rel=3D=
"nofollow">Veroboard prototype</a><br />

    Needs converting to:

    <font size=3D"3"><u>Battery Management System</u></font><br /><a href=
=3D"/viewtopic.php?p=3D15479&t=3D980#15479" target=3D"_blank" rel=3D"nofo=
llow">Veroboard prototype</a><br />

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

--=20
Rob Hills
Waikiki, Western Australia