Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #99667

Re: Find relative url in mixed text/html

From Rob Hills <rhills@medimorphosis.com.au>
Newsgroups comp.lang.python
Subject Re: Find relative url in mixed text/html
Date 2015-11-29 00:25 +0800
Message-ID <mailman.188.1448727914.20593.python-list@python.org> (permalink)
References <mailman.182.1448678122.20593.python-list@python.org> <8737vqyag1.fsf@jester.gateway.pace.com>

Show all headers | View raw


Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills <rhills@medimorphosis.com.au> writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
  Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
    Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
      Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
  Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
    Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800

csiph-web