Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99651

Find relative url in mixed text/html

Path csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From Rob Hills <rhills@medimorphosis.com.au>
Newsgroups comp.lang.python
Subject Find relative url in mixed text/html
Date Sat, 28 Nov 2015 10:35:16 +0800
Lines 67
Message-ID <mailman.182.1448678122.20593.python-list@python.org> (permalink)
Reply-To rhills@medimorphosis.com.au
Mime-Version 1.0
Content-Type text/plain; charset=utf-8
Content-Transfer-Encoding quoted-printable
X-Trace news.uni-berlin.de PY2L0ys7TO/tBiODcH6W1gOGZDESLscK4Rykk91rsmPg==
Return-Path <rhills@medimorphosis.com.au>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.128
X-Spam-Level *
X-Spam-Evidence '*H*': 0.75; '*S*': 0.00; 'subject:text': 0.04; 'converts': 0.07; 'tia,': 0.09; 'python': 0.10; 'volunteer': 0.11; 'prototype': 0.15; 'googling': 0.16; 'received:74.55.86': 0.16; 'received:74.55.86.74': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'received:smtp.webfaction.com': 0.16; 'received:webfaction.com': 0.16; 'sins': 0.16; 'soup': 0.16; 'looked': 0.16; 'url:au': 0.18; 'project,': 0.18; 'mixed': 0.22; 'text,': 0.22; 'trying': 0.22; 'examples': 0.24; 'plain': 0.24; 'testing': 0.25; 'header:User-Agent:1': 0.26; 'question': 0.27; 'converting': 0.27; 'regular': 0.29; 'skip:/ 40': 0.29; 'convert': 0.29; "i'm": 0.30; 'subject:/': 0.30; 'code': 0.30; 'posts': 0.30; 'relative': 0.30; 'skip:[ 10': 0.31; "i'd": 0.31; 'skip:s 30': 0.31; 'post': 0.31; 'another': 0.32; 'are:': 0.32; 'extract': 0.33; 'grateful': 0.33; 'skip:h 40': 0.33; 'list': 0.34; 'so,': 0.35; 'ones': 0.35; 'done': 0.35; 'sometimes': 0.35; 'but': 0.36; 'beginning': 0.36; 'to:addr:python-list': 0.36; 'starting': 0.37; 'wrong': 0.38; 'skip:p 20': 0.38; 'hi,': 0.38; 'received:192': 0.39; 'to:addr:python.org': 0.40; 'some': 0.40; 'australia': 0.61; 'identify': 0.61; 'our': 0.64; 'management': 0.64; 'believe': 0.66; 'beautiful': 0.66; 'talking': 0.67; 'header:Reply-To:1': 0.67; 'therefore': 0.67; 'reply-to:no real name:2**0': 0.71; 'walk': 0.72; 'url:asp': 0.80; '&lt;br': 0.84; '95%': 0.84; 'battery': 0.84; 'skip:/ 30': 0.84; 'url:forums': 0.84; 'url:tid': 0.84; 'western': 0.89; '&lt;a': 0.91; 'belief': 0.91; 'comprise': 0.91; 'fragment': 0.91; 'migrating': 0.91; 'have.': 0.93; 'hills': 0.93
X-Enigmail-Draft-Status N1110
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
X-Content-Filtered-By Mailman/MimeDel 2.1.20+
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.20+
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Xref csiph.com comp.lang.python:99651

Show key headers only | View raw


Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
sometimes just text: ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms. 

Some examples of what I am talking about are:

    Post fragment containing an "Absolute" cross-link:

    <br />ive made a new thread:
    <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
    <br />

    converts to:

    <br />
    <br />ive made a new thread:
    <br />/viewtopic.php?t=316&p=1958#1958

    Post fragment containing a "Relative" cross-link:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

    Needs converting to:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

-- 
Rob Hills
Waikiki, Western Australia

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
  Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
    Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
      Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
  Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800

csiph-web