Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99651 > unrolled thread
| Started by | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| First post | 2015-11-28 10:35 +0800 |
| Last post | 2015-11-29 01:44 +0800 |
| Articles | 8 — 4 participants |
Back to article view | Back to comp.lang.python
Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800
| From | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| Date | 2015-11-28 10:35 +0800 |
| Subject | Find relative url in mixed text/html |
| Message-ID | <mailman.182.1448678122.20593.python-list@python.org> |
Hi,
For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way
through the process.
Posts to our original forum comprise a soup of plain text, HTML and
BBCodes. A post */may/* include links done as either standard HTML
links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
sometimes just text: ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).
In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.
My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in. However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one. Googling so far has not been helpful, but that
might be me using the wrong search terms.
Some examples of what I am talking about are:
Post fragment containing an "Absolute" cross-link:
<br />ive made a new thread:
<br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
<br />
converts to:
<br />
<br />ive made a new thread:
<br />/viewtopic.php?t=316&p=1958#1958
Post fragment containing a "Relative" cross-link:
<font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
Needs converting to:
<font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
So, my question is: What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?
Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have. If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...
TIA,
--
Rob Hills
Waikiki, Western Australia
[toc] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2015-11-27 21:11 -0800 |
| Message-ID | <8737vqyag1.fsf@jester.gateway.pace.com> |
| In reply to | #99651 |
Rob Hills <rhills@medimorphosis.com.au> writes: > Note, in the beginning of this project, I looked at using "Beautiful > Soup" but my reading and limited testing lead me to believe that it is > designed for well-formed HTML/XML and therefore was unsuitable for the > text/html soup I have. If that belief is incorrect, I'd be grateful for > general tips about using Beautiful Soup in this scenario... Beautiful Soup can deal with badly formed HTML pretty well, or at least it could in earlier versions. It gives you several different parsing options to choose from now. I think the default is lxml which is fast but maybe more strict. Check what the others are and see if a loose slow one is still there. It really is pretty slow so plan on a big computation task if you're converting a large forum. phpBB gets a bad rap that's maybe well-deserved but I don't know what to suggest instead.
[toc] | [prev] | [next] | [standalone]
| From | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| Date | 2015-11-29 00:25 +0800 |
| Message-ID | <mailman.188.1448727914.20593.python-list@python.org> |
| In reply to | #99652 |
Hi Paul, On 28/11/15 13:11, Paul Rubin wrote: > Rob Hills <rhills@medimorphosis.com.au> writes: >> Note, in the beginning of this project, I looked at using "Beautiful >> Soup" but my reading and limited testing lead me to believe that it is >> designed for well-formed HTML/XML and therefore was unsuitable for the >> text/html soup I have. If that belief is incorrect, I'd be grateful for >> general tips about using Beautiful Soup in this scenario... > Beautiful Soup can deal with badly formed HTML pretty well, or at least > it could in earlier versions. It gives you several different parsing > options to choose from now. I think the default is lxml which is fast > but maybe more strict. Check what the others are and see if a loose > slow one is still there. It really is pretty slow so plan on a big > computation task if you're converting a large forum. I've had another look at Beautiful Soup and while it doesn't really help me much with urls (relative or absolute) embedded within text, it seems to do a good job of separating out links from the rest, so that could be useful in itself. WRT time, I'm converting about 65MB of data which currently takes 14 seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is pretty amazing performance for Python3, especially given my relatively crude coding skills. It'll be interesting to see if using Beautiful Soup adds significantly to that. > phpBB gets a bad rap that's maybe well-deserved but I don't know what to > suggest instead. I did start to investigate Python-based alternatives; I've not heard much good said about php, but I probably move in the wrong circles. However, our hosting service doesn't support Python so I stopped hunting. Plus there is a significant group of forum members who hold very strong opinions about the functionality they want and it took a lot of work to get them to agree on something! All that said, I'd be interested to see specific (and hopefully unbiased) info about phpBB's failings... Cheers, -- Rob Hills Waikiki, Western Australia
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-11-28 18:04 +0100 |
| Message-ID | <mailman.189.1448730268.20593.python-list@python.org> |
| In reply to | #99652 |
In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes: >All that said, I'd be interested to see specific (and hopefully >unbiased) info about phpBB's failings... People I know of who run different bb software say that the spammers really prefer phpBB. So keeping it spam free is about 4 times the work as for, for instance, IPB. Hackers seem to like it too -- possibly due to this: http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/ make sure you aren't vulnerable.
[toc] | [prev] | [next] | [standalone]
| From | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| Date | 2015-11-29 01:40 +0800 |
| Message-ID | <mailman.190.1448732454.20593.python-list@python.org> |
| In reply to | #99652 |
Hi Laura, On 29/11/15 01:04, Laura Creighton wrote: > In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes: >> All that said, I'd be interested to see specific (and hopefully >> unbiased) info about phpBB's failings... > People I know of who run different bb software say that the spammers > really prefer phpBB. So keeping it spam free is about 4 times the > work as for, for instance, IPB. > > Hackers seem to like it too -- possibly due to this: > http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/ > > make sure you aren't vulnerable. Thanks for the link and the advice. Personally, I'd rather go with something based on a language I am reasonably familiar with (eg Python or Java) however it seems the vast bulk of Forum software is based on PHP :-( Cheers, -- Rob Hills Waikiki, Western Australia
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2015-11-28 10:10 -0800 |
| Message-ID | <87a8pyc7vv.fsf@nightsong.com> |
| In reply to | #99671 |
Rob Hills <rhills@medimorphosis.com.au> writes: > Personally, I'd rather go with something based on a language I am > reasonably familiar with (eg Python or Java) however it seems the vast > bulk of Forum software is based on PHP :-( It's certainly possible to write good software in PHP, so it's mostly a matter of the design and implementation quality. I was on a big PhpBB forum years ago and it got very slow as the database got large, and there were multiple incidents of database corruption. The board eventually switched to VBB which was a lot better. VBB is the best one I know of but it's not FOSS. I'm on another one right now which uses IPB (also not FOSS) and don't like it much (too clever for its own good). Another one is FluxBB which is nice and lightweight and FOSS, but it's a small forum and the software might not be up to handling a bigger one. Some people like Discourse. I don't like it much myself, but that's just me. There's certainly plenty of cheap hosting available these days (or raw VPS) that let you run Python or whatever else you want. But it seems to me that forum software is something of a ghetto. I do think there is some written in Python but I don't remember any specifics.
[toc] | [prev] | [next] | [standalone]
| From | Grobu <snailcoder@retrosite.invalid> |
|---|---|
| Date | 2015-11-28 08:07 +0100 |
| Message-ID | <n3bjmq$pdi$1@dont-email.me> |
| In reply to | #99651 |
On 28/11/15 03:35, Rob Hills wrote: > Hi, > > For my sins I am migrating a volunteer association forum from one > platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way > through the process. > > Posts to our original forum comprise a soup of plain text, HTML and > BBCodes. A post */may/* include links done as either standard HTML > links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or > sometimes just text: ( http://blah.blah.com.au or even just > www.blah.blah.com.au ). > > In my conversion process, I am trying to identify cross-links (links > from one post on the forum to another) so I can convert them to links > that will work in the new forum. > > My current code uses a Regular Expression (yes, I read the recent posts > on this forum about regex and HTML!) to pull out "absolute" links ( > starting with http:// ) and then I use Python to identify and convert > the specific links I am interested in. However, the forum also contains > "cross-links" done using relative links and I'm unsure how best to > proceed with that one. Googling so far has not been helpful, but that > might be me using the wrong search terms. > > Some examples of what I am talking about are: > > Post fragment containing an "Absolute" cross-link: > > <br />ive made a new thread: > <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958 > <br /> > > converts to: > > <br /> > <br />ive made a new thread: > <br />/viewtopic.php?t=316&p=1958#1958 > > Post fragment containing a "Relative" cross-link: > > <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br /> > > Needs converting to: > > <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br /> > > So, my question is: What is the best way to extract a list of "relative > links" from mixed text/html that I can then walk through to identify the > specific ones I want to convert? > > Note, in the beginning of this project, I looked at using "Beautiful > Soup" but my reading and limited testing lead me to believe that it is > designed for well-formed HTML/XML and therefore was unsuitable for the > text/html soup I have. If that belief is incorrect, I'd be grateful for > general tips about using Beautiful Soup in this scenario... > > TIA, > Hi Rob Is it safe to assume that all the relative (cross) links take one of the following forms? : http://www.aeva.asn.au/forums/forum_posts.asp www.aeva.asn.au/forums/forum_posts.asp /forums/forum_posts.asp /forum_posts.asp (are you really sure about this one?) If so, and if your goal boils down to converting all instances of old style URLs to new style ones regardless of the context where they appear, why would a regex fail to meet your needs?
[toc] | [prev] | [next] | [standalone]
| From | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| Date | 2015-11-29 01:44 +0800 |
| Message-ID | <mailman.191.1448732703.20593.python-list@python.org> |
| In reply to | #99653 |
Hi Grobu, On 28/11/15 15:07, Grobu wrote: > Is it safe to assume that all the relative (cross) links take one of > the following forms? : > > http://www.aeva.asn.au/forums/forum_posts.asp > www.aeva.asn.au/forums/forum_posts.asp > /forums/forum_posts.asp > /forum_posts.asp (are you really sure about this one?) > > If so, and if your goal boils down to converting all instances of old > style URLs to new style ones regardless of the context where they > appear, why would a regex fail to meet your needs? I'm actually not discounting anything and as I mentioned, I've already used some regex to extract the properly-formed URLs (those starting with http://). I was fortunately able to find some example regex that I could figure out enough to tweak for my purpose. Unfortunately, my small brain hurts whenever I try and understand what a piece of regex is doing and I don't like having bits in my code that hurt my brain. BTW, that's not meant to be an invitation to someone to produce some regex for me, if I can't find any other way of doing it, I'll try and create my own regex and come back here if I can't get that working. Cheers, -- Rob Hills Waikiki, Western Australia
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web