Groups > comp.lang.python > #99651 > unrolled thread

Find relative url in mixed text/html

Started by	Rob Hills <rhills@medimorphosis.com.au>
First post	2015-11-28 10:35 +0800
Last post	2015-11-29 01:44 +0800
Articles	8 — 4 participants

Back to article view | Back to comp.lang.python

  Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
    Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
      Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
      Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
      Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
        Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
    Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
      Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800

#99651 — Find relative url in mixed text/html

From	Rob Hills <rhills@medimorphosis.com.au>
Date	2015-11-28 10:35 +0800
Subject	Find relative url in mixed text/html
Message-ID	<mailman.182.1448678122.20593.python-list@python.org>

Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
sometimes just text: ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms. 

Some examples of what I am talking about are:

    Post fragment containing an "Absolute" cross-link:

    <br />ive made a new thread:
    <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
    <br />

    converts to:

    <br />
    <br />ive made a new thread:
    <br />/viewtopic.php?t=316&p=1958#1958

    Post fragment containing a "Relative" cross-link:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

    Needs converting to:

    <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

-- 
Rob Hills
Waikiki, Western Australia

[toc] | [next] | [standalone]

#99652

From	Paul Rubin <no.email@nospam.invalid>
Date	2015-11-27 21:11 -0800
Message-ID	<8737vqyag1.fsf@jester.gateway.pace.com>
In reply to	#99651

Rob Hills <rhills@medimorphosis.com.au> writes:
> Note, in the beginning of this project, I looked at using "Beautiful
> Soup" but my reading and limited testing lead me to believe that it is
> designed for well-formed HTML/XML and therefore was unsuitable for the
> text/html soup I have.  If that belief is incorrect, I'd be grateful for
> general tips about using Beautiful Soup in this scenario...

Beautiful Soup can deal with badly formed HTML pretty well, or at least
it could in earlier versions.  It gives you several different parsing
options to choose from now.  I think the default is lxml which is fast
but maybe more strict.  Check what the others are and see if a loose
slow one is still there.  It really is pretty slow so plan on a big
computation task if you're converting a large forum.

phpBB gets a bad rap that's maybe well-deserved but I don't know what to
suggest instead.

[toc] | [prev] | [next] | [standalone]

#99667

From	Rob Hills <rhills@medimorphosis.com.au>
Date	2015-11-29 00:25 +0800
Message-ID	<mailman.188.1448727914.20593.python-list@python.org>
In reply to	#99652

Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills <rhills@medimorphosis.com.au> writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

[toc] | [prev] | [next] | [standalone]

#99670

From	Laura Creighton <lac@openend.se>
Date	2015-11-28 18:04 +0100
Message-ID	<mailman.189.1448730268.20593.python-list@python.org>
In reply to	#99652

In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes:
>All that said, I'd be interested to see specific (and hopefully
>unbiased) info about phpBB's failings...

People I know of who run different bb software say that the spammers
really prefer phpBB.  So keeping it spam free is about 4 times the
work as for, for instance, IPB.

Hackers seem to like it too -- possibly due to this:
http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/

make sure you aren't vulnerable.

[toc] | [prev] | [next] | [standalone]

#99671

From	Rob Hills <rhills@medimorphosis.com.au>
Date	2015-11-29 01:40 +0800
Message-ID	<mailman.190.1448732454.20593.python-list@python.org>
In reply to	#99652

Hi Laura,

On 29/11/15 01:04, Laura Creighton wrote:
> In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes:
>> All that said, I'd be interested to see specific (and hopefully
>> unbiased) info about phpBB's failings...
> People I know of who run different bb software say that the spammers
> really prefer phpBB.  So keeping it spam free is about 4 times the
> work as for, for instance, IPB.
>
> Hackers seem to like it too -- possibly due to this:
> http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/
>
> make sure you aren't vulnerable.

Thanks for the link and the advice.

Personally, I'd rather go with something based on a language I am
reasonably familiar with (eg Python or Java) however it seems the vast
bulk of Forum software is based on PHP :-(

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

[toc] | [prev] | [next] | [standalone]

#99673

From	Paul Rubin <no.email@nospam.invalid>
Date	2015-11-28 10:10 -0800
Message-ID	<87a8pyc7vv.fsf@nightsong.com>
In reply to	#99671

Rob Hills <rhills@medimorphosis.com.au> writes:
> Personally, I'd rather go with something based on a language I am
> reasonably familiar with (eg Python or Java) however it seems the vast
> bulk of Forum software is based on PHP :-(

It's certainly possible to write good software in PHP, so it's mostly
a matter of the design and implementation quality.

I was on a big PhpBB forum years ago and it got very slow as the
database got large, and there were multiple incidents of database
corruption.  The board eventually switched to VBB which was a lot
better.  VBB is the best one I know of but it's not FOSS.

I'm on another one right now which uses IPB (also not FOSS) and don't
like it much (too clever for its own good).

Another one is FluxBB which is nice and lightweight and FOSS, but it's a
small forum and the software might not be up to handling a bigger one.

Some people like Discourse.  I don't like it much myself, but that's
just me.

There's certainly plenty of cheap hosting available these days (or raw
VPS) that let you run Python or whatever else you want.  But it seems to
me that forum software is something of a ghetto.  I do think there is
some written in Python but I don't remember any specifics.

[toc] | [prev] | [next] | [standalone]

#99653

From	Grobu <snailcoder@retrosite.invalid>
Date	2015-11-28 08:07 +0100
Message-ID	<n3bjmq$pdi$1@dont-email.me>
In reply to	#99651

On 28/11/15 03:35, Rob Hills wrote:
> Hi,
>
> For my sins I am migrating a volunteer association forum from one
> platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
> through the process.
>
> Posts to our original forum comprise a soup of plain text, HTML and
> BBCodes.  A post */may/* include links done as either standard HTML
> links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
> sometimes just text: ( http://blah.blah.com.au or even just
> www.blah.blah.com.au ).
>
> In my conversion process, I am trying to identify cross-links (links
> from one post on the forum to another) so I can convert them to links
> that will work in the new forum.
>
> My current code uses a Regular Expression (yes, I read the recent posts
> on this forum about regex and HTML!) to pull out "absolute" links (
> starting with http:// ) and then I use Python to identify and convert
> the specific links I am interested in.  However, the forum also contains
> "cross-links" done using relative links and I'm unsure how best to
> proceed with that one.  Googling so far has not been helpful, but that
> might be me using the wrong search terms.
>
> Some examples of what I am talking about are:
>
>      Post fragment containing an "Absolute" cross-link:
>
>      <br />ive made a new thread:
>      <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
>      <br />
>
>      converts to:
>
>      <br />
>      <br />ive made a new thread:
>      <br />/viewtopic.php?t=316&p=1958#1958
>
>      Post fragment containing a "Relative" cross-link:
>
>      <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
>
>      Needs converting to:
>
>      <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
>
> So, my question is:  What is the best way to extract a list of "relative
> links" from mixed text/html that I can then walk through to identify the
> specific ones I want to convert?
>
> Note, in the beginning of this project, I looked at using "Beautiful
> Soup" but my reading and limited testing lead me to believe that it is
> designed for well-formed HTML/XML and therefore was unsuitable for the
> text/html soup I have.  If that belief is incorrect, I'd be grateful for
> general tips about using Beautiful Soup in this scenario...
>
> TIA,
>

Hi Rob

Is it safe to assume that all the relative (cross) links take one of the 
following forms? :

	http://www.aeva.asn.au/forums/forum_posts.asp
	www.aeva.asn.au/forums/forum_posts.asp
	/forums/forum_posts.asp
	/forum_posts.asp (are you really sure about this one?)

If so, and if your goal boils down to converting all instances of old 
style URLs to new style ones regardless of the context where they 
appear, why would a regex fail to meet your needs?

[toc] | [prev] | [next] | [standalone]

#99672

From	Rob Hills <rhills@medimorphosis.com.au>
Date	2015-11-29 01:44 +0800
Message-ID	<mailman.191.1448732703.20593.python-list@python.org>
In reply to	#99653

Hi Grobu,

On 28/11/15 15:07, Grobu wrote:
> Is it safe to assume that all the relative (cross) links take one of
> the following forms? :
>
>     http://www.aeva.asn.au/forums/forum_posts.asp
>     www.aeva.asn.au/forums/forum_posts.asp
>     /forums/forum_posts.asp
>     /forum_posts.asp (are you really sure about this one?)
>
> If so, and if your goal boils down to converting all instances of old
> style URLs to new style ones regardless of the context where they
> appear, why would a regex fail to meet your needs?

I'm actually not discounting anything and as I mentioned, I've already
used some regex to extract the properly-formed URLs (those starting with
http://).  I was fortunately able to find some example regex that I
could figure out enough to tweak for my purpose.  Unfortunately, my
small brain hurts whenever I try and understand what a piece of regex is
doing and I don't like having bits in my code that hurt my brain. 

BTW, that's not meant to be an invitation to someone to produce some
regex for me, if I can't find any other way of doing it, I'll try and
create my own regex and come back here if I can't get that working.

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

[toc] | [prev] | [standalone]

csiph-web

Find relative url in mixed text/html

Contents

#99651 — Find relative url in mixed text/html

#99652

#99667

#99670

#99671

#99673

#99653

#99672