Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #21503 > unrolled thread

html5lib not thread safe. Is the Python SAX library thread-safe?

Started byJohn Nagle <nagle@animats.com>
First post2012-03-11 13:30 -0700
Last post2012-03-12 09:07 -0700
Articles 6 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 13:30 -0700
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? Cameron Simpson <cs@zip.com.au> - 2012-03-12 08:45 +1100
      Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-11 21:48 -0700
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? Paul Rubin <no.email@nospam.invalid> - 2012-03-12 02:39 -0700
    Re: html5lib not thread safe. Is the Python SAX library thread-safe? Stefan Behnel <stefan_ml@behnel.de> - 2012-03-12 11:05 +0100
      Re: html5lib not thread safe. Is the Python SAX library thread-safe? John Nagle <nagle@animats.com> - 2012-03-12 09:07 -0700

#21503 — html5lib not thread safe. Is the Python SAX library thread-safe?

FromJohn Nagle <nagle@animats.com>
Date2012-03-11 13:30 -0700
Subjecthtml5lib not thread safe. Is the Python SAX library thread-safe?
Message-ID<4f5d0b82$0$11967$742ec2ed@news.sonic.net>
    "html5lib" is apparently not thread safe.
(see "http://code.google.com/p/html5lib/issues/detail?id=189")
Looking at the code, I've only found about three problems.
They're all the usual "cached in a global without locking" bug.
A few locks would fix that.

    But html5lib calls the XML SAX parser. Is that thread-safe?
Or is there more trouble down at the bottom?

(I run a multi-threaded web crawler, and currently use BeautifulSoup,
which is thread safe, although dated.  I'm looking at converting to
html5lib.)

				John Nagle

[toc] | [next] | [standalone]


#21506

FromCameron Simpson <cs@zip.com.au>
Date2012-03-12 08:45 +1100
Message-ID<mailman.574.1331502568.3037.python-list@python.org>
In reply to#21503
On 11Mar2012 13:30, John Nagle <nagle@animats.com> wrote:
|     "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189")
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
| 
|     But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
| 
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated.  I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    "Beautiful Soup 4 uses html.parser by default, but you can plug in
    lxml or html5lib and use that instead."

Just for interest, re locking, I wrote a little decorator the other day,
thus:

  @locked_property
  def foo(self):
    compute foo here ...
    return foo value

and am rolling its use out amongst my classes. Code:

  def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
    ''' A property whose access is controlled by a lock if unset.
    '''
    if prop_name is None:
      prop_name = '_' + func.func_name
    def getprop(self):
      ''' Attempt lockless fetch of property first.
          Use lock if property is unset.
      '''
      p = getattr(self, prop_name)
      if p is unset_object:
        with getattr(self, lock_name):
          p = getattr(self, prop_name)
          if p is unset_object:
            p = func(self)
            setattr(self, prop_name, p)
      return p
    return property(getprop)

It tries to be lockless in the common case. I suspect it is only safe in
CPython where there is a GIL. If raw python assignments and fetches can
overlap (eg Jypthon I think?) I probably need shared "read" lock around
the first "p = getattr(self, prop_name). Any remarks?

Cheers,
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Ed Campbell's <ed@Tekelex.Com> pointers for long trips:
1. lay out the bare minimum of stuff that you need to take with you, then
   put at least half of it back.

[toc] | [prev] | [next] | [standalone]


#21520

FromJohn Nagle <nagle@animats.com>
Date2012-03-11 21:48 -0700
Message-ID<4f5d8012$0$12021$742ec2ed@news.sonic.net>
In reply to#21506
On 3/11/2012 2:45 PM, Cameron Simpson wrote:
> On 11Mar2012 13:30, John Nagle<nagle@animats.com>  wrote:
> |     "html5lib" is apparently not thread safe.
> | (see "http://code.google.com/p/html5lib/issues/detail?id=189")
> | Looking at the code, I've only found about three problems.
> | They're all the usual "cached in a global without locking" bug.
> | A few locks would fix that.
> |
> |     But html5lib calls the XML SAX parser. Is that thread-safe?
> | Or is there more trouble down at the bottom?
> |
> | (I run a multi-threaded web crawler, and currently use BeautifulSoup,
> | which is thread safe, although dated.  I'm looking at converting to
> | html5lib.)
>
> IIRC, BeautifulSoup4 may do that for you:
>
>    http://www.crummy.com/software/BeautifulSoup/bs4/doc/
>
>    http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
>      "Beautiful Soup 4 uses html.parser by default, but you can plug in
>      lxml or html5lib and use that instead."

    I want to use HTML5 standard parsing of bad HTML.  (HTML5 formally
defines how to parse bad comments, for example.)  I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.

				John Nagle

[toc] | [prev] | [next] | [standalone]


#21525

FromPaul Rubin <no.email@nospam.invalid>
Date2012-03-12 02:39 -0700
Message-ID<7x399e40pw.fsf@ruckus.brouhaha.com>
In reply to#21503
John Nagle <nagle@animats.com> writes:
>    But html5lib calls the XML SAX parser. Is that thread-safe?
> Or is there more trouble down at the bottom?

According to

  http://xmlbench.sourceforge.net/results/features200303/index.html

libxml and expat both purport to be thread-safe.  I've used the python
expat library (not from multiple threads) and it works fine, though the
python calls slow it down by worse than an order of magnitude.

[toc] | [prev] | [next] | [standalone]


#21526

FromStefan Behnel <stefan_ml@behnel.de>
Date2012-03-12 11:05 +0100
Message-ID<mailman.582.1331546749.3037.python-list@python.org>
In reply to#21503
John Nagle, 11.03.2012 21:30:
>    "html5lib" is apparently not thread safe.
> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
> Looking at the code, I've only found about three problems.
> They're all the usual "cached in a global without locking" bug.
> A few locks would fix that.
> 
>    But html5lib calls the XML SAX parser. Is that thread-safe?
> Or is there more trouble down at the bottom?
> 
> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
> which is thread safe, although dated.  I'm looking at converting to
> html5lib.)

You may also consider moving to lxml. BeautifulSoup supports it as a parser
backend these days, so you wouldn't even have to rewrite your code to use
it. And performance-wise, well ...

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

[toc] | [prev] | [next] | [standalone]


#21537

FromJohn Nagle <nagle@animats.com>
Date2012-03-12 09:07 -0700
Message-ID<4f5e1f5b$0$12023$742ec2ed@news.sonic.net>
In reply to#21526
On 3/12/2012 3:05 AM, Stefan Behnel wrote:
> John Nagle, 11.03.2012 21:30:
>>     "html5lib" is apparently not thread safe.
>> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
>> Looking at the code, I've only found about three problems.
>> They're all the usual "cached in a global without locking" bug.
>> A few locks would fix that.
>>
>>     But html5lib calls the XML SAX parser. Is that thread-safe?
>> Or is there more trouble down at the bottom?
>>
>> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
>> which is thread safe, although dated.  I'm looking at converting to
>> html5lib.)
>
> You may also consider moving to lxml. BeautifulSoup supports it as a parser
> backend these days, so you wouldn't even have to rewrite your code to use
> it. And performance-wise, well ...
>
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Stefan

    I want to move to html5lib because it handles HTML errors as
specified by the HTML5 spec, which is what all newer browsers do.
The HTML5 spec actually specifies, in great detail, how to parse
common errors in HTML.  It's amusing seeing that formalized.
Malformed comments ( <- instead of <-- ) are now handled in
a standard way, for example.  So I'm trying to get html5parser
fixed for thread safety.

                                    John Nagle
				

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web