Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #99667
| From | Rob Hills <rhills@medimorphosis.com.au> |
|---|---|
| Newsgroups | comp.lang.python |
| Subject | Re: Find relative url in mixed text/html |
| Date | 2015-11-29 00:25 +0800 |
| Message-ID | <mailman.188.1448727914.20593.python-list@python.org> (permalink) |
| References | <mailman.182.1448678122.20593.python-list@python.org> <8737vqyag1.fsf@jester.gateway.pace.com> |
Hi Paul, On 28/11/15 13:11, Paul Rubin wrote: > Rob Hills <rhills@medimorphosis.com.au> writes: >> Note, in the beginning of this project, I looked at using "Beautiful >> Soup" but my reading and limited testing lead me to believe that it is >> designed for well-formed HTML/XML and therefore was unsuitable for the >> text/html soup I have. If that belief is incorrect, I'd be grateful for >> general tips about using Beautiful Soup in this scenario... > Beautiful Soup can deal with badly formed HTML pretty well, or at least > it could in earlier versions. It gives you several different parsing > options to choose from now. I think the default is lxml which is fast > but maybe more strict. Check what the others are and see if a loose > slow one is still there. It really is pretty slow so plan on a big > computation task if you're converting a large forum. I've had another look at Beautiful Soup and while it doesn't really help me much with urls (relative or absolute) embedded within text, it seems to do a good job of separating out links from the rest, so that could be useful in itself. WRT time, I'm converting about 65MB of data which currently takes 14 seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is pretty amazing performance for Python3, especially given my relatively crude coding skills. It'll be interesting to see if using Beautiful Soup adds significantly to that. > phpBB gets a bad rap that's maybe well-deserved but I don't know what to > suggest instead. I did start to investigate Python-based alternatives; I've not heard much good said about php, but I probably move in the wrong circles. However, our hosting service doesn't support Python so I stopped hunting. Plus there is a significant group of forum members who hold very strong opinions about the functionality they want and it took a lot of work to get them to agree on something! All that said, I'd be interested to see specific (and hopefully unbiased) info about phpBB's failings... Cheers, -- Rob Hills Waikiki, Western Australia
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-28 10:35 +0800
Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-27 21:11 -0800
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 00:25 +0800
Re: Find relative url in mixed text/html Laura Creighton <lac@openend.se> - 2015-11-28 18:04 +0100
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:40 +0800
Re: Find relative url in mixed text/html Paul Rubin <no.email@nospam.invalid> - 2015-11-28 10:10 -0800
Re: Find relative url in mixed text/html Grobu <snailcoder@retrosite.invalid> - 2015-11-28 08:07 +0100
Re: Find relative url in mixed text/html Rob Hills <rhills@medimorphosis.com.au> - 2015-11-29 01:44 +0800
csiph-web