Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21506

Re: html5lib not thread safe. Is the Python SAX library thread-safe?

Path csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!multikabel.net!newsfeed20.multikabel.net!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <cameron@cskk.homeip.net>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'classes.': 0.05; 'cpython': 0.05; 'subject:Python': 0.05; 'bug.': 0.07; 'none:': 0.07; 'received:edu.au': 0.07; 'python': 0.08; "'''": 0.09; 'compute': 0.09; 'decorator': 0.09; 'fetch': 0.09; 'foo': 0.09; 'pointers': 0.09; 'rolling': 0.09; 'subject:library': 0.09; 'def': 0.13; 'url:software': 0.13; 'case.': 0.15; 'converting': 0.15; 'tries': 0.15; '"cached': 0.16; '(eg': 0.16; "campbell's": 0.16; 'fetches': 0.16; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'gil.': 0.16; 'html5lib': 0.16; 'iirc,': 0.16; 'message-id:@cskk.homeip.net': 0.16; 'received:202.125.174': 0.16; 'received:202.125.174.133': 0.16; 'received:boardofstudies.nsw.edu.au': 0.16; 'received:cskk.homeip.net': 0.16; 'received:harvey.boardofstudies.nsw.edu.au': 0.16; 'received:homeip.net': 0.16; 'received:nsw.edu.au': 0.16; 'remarks?': 0.16; 'safe,': 0.16; 'soup': 0.16; 'subject:html5lib': 0.16; 'url:issues': 0.16; 'cc:addr:python-list': 0.16; 'wrote:': 0.18; 'subject:not': 0.19; 'cheers,': 0.20; 'cc:no real name:2**0': 0.21; 'wrote': 0.21; 'header:In-Reply-To:1': 0.22; 'default,': 0.23; 'suspect': 0.24; 'fix': 0.25; 'url:doc': 0.25; 'cc:2**0': 0.26; 'stuff': 0.26; 'code,': 0.28; '(see': 0.28; 'url:code': 0.28; "i'm": 0.28; 'cc:addr:python.org': 0.29; 'bare': 0.30; 'controlled': 0.30; 'lock': 0.30; 'locks': 0.30; 'url:detail': 0.30; 'least': 0.30; 'xml': 0.31; 'subject:?': 0.31; 'shared': 0.31; 'thread': 0.32; "i've": 0.32; 'usual': 0.32; 'there': 0.33; 'header:User-Agent:1': 0.33; 'skip:@ 10': 0.34; 'probably': 0.35; '...': 0.35; 'trouble': 0.35; 'received:au': 0.36; 'run': 0.37; 'but': 0.37; 'charset:us-ascii': 0.37; 'uses': 0.38; 'back.': 0.38; 'first.': 0.39; 'that.': 0.39; 'raw': 0.40; 'put': 0.40; 'john': 0.61; 'more': 0.61; 'url:p': 0.62; 'property': 0.63; 'subject:. ': 0.63; 'here': 0.64; 'day,': 0.65; 'received:202': 0.66; 'safe': 0.70; 'cameron': 0.77; 'you:': 0.82; 'amongst': 0.91; 'interest,': 0.91; 'safe.': 0.95
Date Mon, 12 Mar 2012 08:45:01 +1100
From Cameron Simpson <cs@zip.com.au>
To John Nagle <nagle@animats.com>
Subject Re: html5lib not thread safe. Is the Python SAX library thread-safe?
MIME-Version 1.0
Content-Type text/plain; charset=us-ascii
Content-Disposition inline
In-Reply-To <4f5d0b82$0$11967$742ec2ed@news.sonic.net>
User-Agent Mutt/1.5.21 (2010-09-15)
References <4f5d0b82$0$11967$742ec2ed@news.sonic.net>
Cc python-list@python.org
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.574.1331502568.3037.python-list@python.org> (permalink)
Lines 64
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1331502568 news.xs4all.nl 6949 [2001:888:2000:d::a6]:39434
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:21506

Show key headers only | View raw


On 11Mar2012 13:30, John Nagle <nagle@animats.com> wrote:
|     "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
| 
|     But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
| 
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated.  I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    "Beautiful Soup 4 uses html.parser by default, but you can plug in
    lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

  @locked_property
  def foo(self):
    compute foo here ...
    return foo value

and am rolling its use out amongst my classes. Code:

  def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
    ''' A property whose access is controlled by a lock if unset.
    '''
    if prop_name is None:
      prop_name = '_' + func.func_name
    def getprop(self):
      ''' Attempt lockless fetch of property first.
          Use lock if property is unset.
      '''
      p = getattr(self, prop_name)
      if p is unset_object:
        with getattr(self, lock_name):
          p = getattr(self, prop_name)
          if p is unset_object:
            p = func(self)
            setattr(self, prop_name, p)
      return p
    return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <ed@Tekelex.Com> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
   put at least half of it back.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 13:30 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Cameron Simpson <cs@zip.com.au> - 2012-03-12 08:45 +1100
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 21:48 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Paul Rubin <no.email@nospam.invalid> - 2012-03-12 02:39 -0700
  Re: html5lib not thread safe. Is the Python SAX library thread-safe? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-12 11:05 +0100
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-12 09:07 -0700

csiph-web