Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #21503 > unrolled thread
| Started by | John Nagle <nagle@animats.com> |
|---|---|
| First post | 2012-03-11 13:30 -0700 |
| Last post | 2012-03-12 09:07 -0700 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 13:30 -0700
Re: html5lib not thread safe. Is the Python SAX library thread-safe? Cameron Simpson <cs@zip.com.au> - 2012-03-12 08:45 +1100
Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 21:48 -0700
Re: html5lib not thread safe. Is the Python SAX library thread-safe? Paul Rubin <no.email@nospam.invalid> - 2012-03-12 02:39 -0700
Re: html5lib not thread safe. Is the Python SAX library thread-safe? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-12 11:05 +0100
Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-12 09:07 -0700
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2012-03-11 13:30 -0700 |
| Subject | html5lib not thread safe. Is the Python SAX library thread-safe? |
| Message-ID | <4f5d0b82$0$11967$742ec2ed@news.sonic.net> |
"html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.
But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?
(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated. I'm looking at converting to
html5lib.)
John Nagle
[toc] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2012-03-12 08:45 +1100 |
| Message-ID | <mailman.574.1331502568.3037.python-list@python.org> |
| In reply to | #21503 |
On 11Mar2012 13:30, John Nagle <nagle@animats.com> wrote:
| "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
| But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated. I'm looking at converting to
| html5lib.)
IIRC, BeautifulSoup4 may do that for you:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
"Beautiful Soup 4 uses html.parser by default, but you can plug in
lxml or html5lib and use that instead."
Just for interest, re locking, I wrote a little decorator the other day,
thus:
@locked_property
def foo(self):
compute foo here ...
return foo value
and am rolling its use out amongst my classes. Code:
def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
''' A property whose access is controlled by a lock if unset.
'''
if prop_name is None:
prop_name = '_' + func.func_name
def getprop(self):
''' Attempt lockless fetch of property first.
Use lock if property is unset.
'''
p = getattr(self, prop_name)
if p is unset_object:
with getattr(self, lock_name):
p = getattr(self, prop_name)
if p is unset_object:
p = func(self)
setattr(self, prop_name, p)
return p
return property(getprop)
It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?
Cheers,
--
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
Ed Campbell's <ed@Tekelex.Com> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
put at least half of it back.
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2012-03-11 21:48 -0700 |
| Message-ID | <4f5d8012$0$12021$742ec2ed@news.sonic.net> |
| In reply to | #21506 |
On 3/11/2012 2:45 PM, Cameron Simpson wrote:
> On 11Mar2012 13:30, John Nagle<nagle@animats.com> wrote:
> | "html5lib" is apparently not thread safe.
> | (see "http://code.google.com/p/html5lib/issues/detail?id=189")
> | Looking at the code, I've only found about three problems.
> | They're all the usual "cached in a global without locking" bug.
> | A few locks would fix that.
> |
> | But html5lib calls the XML SAX parser. Is that thread-safe?
> | Or is there more trouble down at the bottom?
> |
> | (I run a multi-threaded web crawler, and currently use BeautifulSoup,
> | which is thread safe, although dated. I'm looking at converting to
> | html5lib.)
>
> IIRC, BeautifulSoup4 may do that for you:
>
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/
>
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
> "Beautiful Soup 4 uses html.parser by default, but you can plug in
> lxml or html5lib and use that instead."
I want to use HTML5 standard parsing of bad HTML. (HTML5 formally
defines how to parse bad comments, for example.) I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2012-03-12 02:39 -0700 |
| Message-ID | <7x399e40pw.fsf@ruckus.brouhaha.com> |
| In reply to | #21503 |
John Nagle <nagle@animats.com> writes: > But html5lib calls the XML SAX parser. Is that thread-safe? > Or is there more trouble down at the bottom? According to http://xmlbench.sourceforge.net/results/features200303/index.html libxml and expat both purport to be thread-safe. I've used the python expat library (not from multiple threads) and it works fine, though the python calls slow it down by worse than an order of magnitude.
[toc] | [prev] | [next] | [standalone]
| From | Stefan Behnel <stefan_ml@behnel.de> |
|---|---|
| Date | 2012-03-12 11:05 +0100 |
| Message-ID | <mailman.582.1331546749.3037.python-list@python.org> |
| In reply to | #21503 |
John Nagle, 11.03.2012 21:30: > "html5lib" is apparently not thread safe. > (see "http://code.google.com/p/html5lib/issues/detail?id=189") > Looking at the code, I've only found about three problems. > They're all the usual "cached in a global without locking" bug. > A few locks would fix that. > > But html5lib calls the XML SAX parser. Is that thread-safe? > Or is there more trouble down at the bottom? > > (I run a multi-threaded web crawler, and currently use BeautifulSoup, > which is thread safe, although dated. I'm looking at converting to > html5lib.) You may also consider moving to lxml. BeautifulSoup supports it as a parser backend these days, so you wouldn't even have to rewrite your code to use it. And performance-wise, well ... http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Stefan
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2012-03-12 09:07 -0700 |
| Message-ID | <4f5e1f5b$0$12023$742ec2ed@news.sonic.net> |
| In reply to | #21526 |
On 3/12/2012 3:05 AM, Stefan Behnel wrote:
> John Nagle, 11.03.2012 21:30:
>> "html5lib" is apparently not thread safe.
>> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
>> Looking at the code, I've only found about three problems.
>> They're all the usual "cached in a global without locking" bug.
>> A few locks would fix that.
>>
>> But html5lib calls the XML SAX parser. Is that thread-safe?
>> Or is there more trouble down at the bottom?
>>
>> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
>> which is thread safe, although dated. I'm looking at converting to
>> html5lib.)
>
> You may also consider moving to lxml. BeautifulSoup supports it as a parser
> backend these days, so you wouldn't even have to rewrite your code to use
> it. And performance-wise, well ...
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Stefan
I want to move to html5lib because it handles HTML errors as
specified by the HTML5 spec, which is what all newer browsers do.
The HTML5 spec actually specifies, in great detail, how to parse
common errors in HTML. It's amusing seeing that formalized.
Malformed comments ( <- instead of <-- ) are now handled in
a standard way, for example. So I'm trying to get html5parser
fixed for thread safety.
John Nagle
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web