Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21506

Re: html5lib not thread safe. Is the Python SAX library thread-safe?

Date 2012-03-12 08:45 +1100
From Cameron Simpson <cs@zip.com.au>
Subject Re: html5lib not thread safe. Is the Python SAX library thread-safe?
References <4f5d0b82$0$11967$742ec2ed@news.sonic.net>
Newsgroups comp.lang.python
Message-ID <mailman.574.1331502568.3037.python-list@python.org> (permalink)

Show all headers | View raw


On 11Mar2012 13:30, John Nagle <nagle@animats.com> wrote:
|     "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
| 
|     But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
| 
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated.  I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    "Beautiful Soup 4 uses html.parser by default, but you can plug in
    lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

  @locked_property
  def foo(self):
    compute foo here ...
    return foo value

and am rolling its use out amongst my classes. Code:

  def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
    ''' A property whose access is controlled by a lock if unset.
    '''
    if prop_name is None:
      prop_name = '_' + func.func_name
    def getprop(self):
      ''' Attempt lockless fetch of property first.
          Use lock if property is unset.
      '''
      p = getattr(self, prop_name)
      if p is unset_object:
        with getattr(self, lock_name):
          p = getattr(self, prop_name)
          if p is unset_object:
            p = func(self)
            setattr(self, prop_name, p)
      return p
    return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <ed@Tekelex.Com> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
   put at least half of it back.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 13:30 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Cameron Simpson <cs@zip.com.au> - 2012-03-12 08:45 +1100
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 21:48 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Paul Rubin <no.email@nospam.invalid> - 2012-03-12 02:39 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-12 11:05 +0100
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-12 09:07 -0700

csiph-web