Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!tudelft.nl!txtfeed1.tudelft.nl!multikabel.net!newsfeed20.multikabel.net!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'classes.': 0.05; 'cpython': 0.05; 'subject:Python': 0.05; 'bug.': 0.07; 'none:': 0.07; 'received:edu.au': 0.07; 'python': 0.08; "'''": 0.09; 'compute': 0.09; 'decorator': 0.09; 'fetch': 0.09; 'foo': 0.09; 'pointers': 0.09; 'rolling': 0.09; 'subject:library': 0.09; 'def': 0.13; 'url:software': 0.13; 'case.': 0.15; 'converting': 0.15; 'tries': 0.15; '"cached': 0.16; '(eg': 0.16; "campbell's": 0.16; 'fetches': 0.16; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'gil.': 0.16; 'html5lib': 0.16; 'iirc,': 0.16; 'message-id:@cskk.homeip.net': 0.16; 'received:202.125.174': 0.16; 'received:202.125.174.133': 0.16; 'received:boardofstudies.nsw.edu.au': 0.16; 'received:cskk.homeip.net': 0.16; 'received:harvey.boardofstudies.nsw.edu.au': 0.16; 'received:homeip.net': 0.16; 'received:nsw.edu.au': 0.16; 'remarks?': 0.16; 'safe,': 0.16; 'soup': 0.16; 'subject:html5lib': 0.16; 'url:issues': 0.16; 'cc:addr:python-list': 0.16; 'wrote:': 0.18; 'subject:not': 0.19; 'cheers,': 0.20; 'cc:no real name:2**0': 0.21; 'wrote': 0.21; 'header:In-Reply-To:1': 0.22; 'default,': 0.23; 'suspect': 0.24; 'fix': 0.25; 'url:doc': 0.25; 'cc:2**0': 0.26; 'stuff': 0.26; 'code,': 0.28; '(see': 0.28; 'url:code': 0.28; "i'm": 0.28; 'cc:addr:python.org': 0.29; 'bare': 0.30; 'controlled': 0.30; 'lock': 0.30; 'locks': 0.30; 'url:detail': 0.30; 'least': 0.30; 'xml': 0.31; 'subject:?': 0.31; 'shared': 0.31; 'thread': 0.32; "i've": 0.32; 'usual': 0.32; 'there': 0.33; 'header:User-Agent:1': 0.33; 'skip:@ 10': 0.34; 'probably': 0.35; '...': 0.35; 'trouble': 0.35; 'received:au': 0.36; 'run': 0.37; 'but': 0.37; 'charset:us-ascii': 0.37; 'uses': 0.38; 'back.': 0.38; 'first.': 0.39; 'that.': 0.39; 'raw': 0.40; 'put': 0.40; 'john': 0.61; 'more': 0.61; 'url:p': 0.62; 'property': 0.63; 'subject:. ': 0.63; 'here': 0.64; 'day,': 0.65; 'received:202': 0.66; 'safe': 0.70; 'cameron': 0.77; 'you:': 0.82; 'amongst': 0.91; 'interest,': 0.91; 'safe.': 0.95 Date: Mon, 12 Mar 2012 08:45:01 +1100 From: Cameron Simpson To: John Nagle Subject: Re: html5lib not thread safe. Is the Python SAX library thread-safe? MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4f5d0b82$0$11967$742ec2ed@news.sonic.net> User-Agent: Mutt/1.5.21 (2010-09-15) References: <4f5d0b82$0$11967$742ec2ed@news.sonic.net> Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 64 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1331502568 news.xs4all.nl 6949 [2001:888:2000:d::a6]:39434 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:21506 On 11Mar2012 13:30, John Nagle wrote: | "html5lib" is apparently not thread safe. | (see "http://code.google.com/p/html5lib/issues/detail?id=189") | Looking at the code, I've only found about three problems. | They're all the usual "cached in a global without locking" bug. | A few locks would fix that. | | But html5lib calls the XML SAX parser. Is that thread-safe? | Or is there more trouble down at the bottom? | | (I run a multi-threaded web crawler, and currently use BeautifulSoup, | which is thread safe, although dated. I'm looking at converting to | html5lib.) IIRC, BeautifulSoup4 may do that for you: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser "Beautiful Soup 4 uses html.parser by default, but you can plug in lxml or html5lib and use that instead." Just for interest, re locking, I wrote a little decorator the other day, thus: @locked_property def foo(self): compute foo here ... return foo value and am rolling its use out amongst my classes. Code: def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None): ''' A property whose access is controlled by a lock if unset. ''' if prop_name is None: prop_name = '_' + func.func_name def getprop(self): ''' Attempt lockless fetch of property first. Use lock if property is unset. ''' p = getattr(self, prop_name) if p is unset_object: with getattr(self, lock_name): p = getattr(self, prop_name) if p is unset_object: p = func(self) setattr(self, prop_name, p) return p return property(getprop) It tries to be lockless in the common case. I suspect it is only safe in CPython where there is a GIL. If raw python assignments and fetches can overlap (eg Jypthon I think?) I probably need shared "read" lock around the first "p = getattr(self, prop_name). Any remarks? Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Ed Campbell's pointers for long trips: 1. lay out the bare minimum of stuff that you need to take with you, then put at least half of it back.