Groups > comp.lang.python > #28341 > unrolled thread

tornado.web ioloop add_timeout eats CPU

Started by	Laszlo Nagy <gandalf@shopzeus.com>
First post	2012-09-03 08:31 +0200
Last post	2012-09-04 01:21 -0700
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  tornado.web ioloop add_timeout eats CPU Laszlo Nagy <gandalf@shopzeus.com> - 2012-09-03 08:31 +0200
    Re: tornado.web ioloop add_timeout eats CPU Bryan <bryanjugglercryptographer@yahoo.com> - 2012-09-03 12:47 -0700
      Re: tornado.web ioloop add_timeout eats CPU Laszlo Nagy <gandalf@shopzeus.com> - 2012-09-04 09:30 +0200
        Re: tornado.web ioloop add_timeout eats CPU Paul Rubin <no.email@nospam.invalid> - 2012-09-04 01:21 -0700

#28341 — tornado.web ioloop add_timeout eats CPU

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-09-03 08:31 +0200
Subject	tornado.web ioloop add_timeout eats CPU
Message-ID	<mailman.129.1346653953.27098.python-list@python.org>

JavaScript clients (browsers) do long poll requests. Each request can 
take up to 10 seconds before the server responds. On the server side, 
every client has a queue of messages that needs to be sent to the 
client. When the long poll request comes in, the server checks if there 
are messages to be sent out. If there are no outgoing messages, then it 
does not finish the response, but calls ioloop's add_timeout method for 
doing further checks. After 10 seconds (if there are no new messages) 
the server returns 304/not modified. If there is a message, then it is 
sent back to the client as fast as possible, and the client comes back 
with another long poll immediately.

These message queues are used for UI updates and also for instant 
messaging. UI must be responsive. For this reason, any message in the 
outgoing queue should be sent out to the client within 0.1 seconds. 
Sometimes (rarely) lots of messages arrive quickly, and in those cases 
it would be good to send them out even faster. What I did is that in the 
first 0.1 seconds, I call add_timeout with 0.01 seconds. So if the 
outgoing queue is full of messages, then they are delivered quickly.  
After 0.1 seconds lapsed, add_timeout is called with 0.1 sec parameter. 
So the server load is reduced because most clients are inactive, and 
they are going to get callbacks in every 0.1 sec.

Here are the two most important methods of my request handler:

     @tornado.web.asynchronous
     def post(self):
         """Handle POST requests."""
         # Disable caching
         self.set_header("Cache-Control","no-cache, must-revalidate")
         self.set_header("Expires","Mon, 26 Jul 1997 05:00:00 GMT")
         self.poll_start = time.time()
         action = self.get_argument("action")
         if action=="poll":
             self.poll()
         elif action=="message":
             self.process_incoming(self.get_argument("message"))
         else:
             self.set_status(400)
             self.finish()

     def poll(self):
         """Handle POLL request for the browser's message loop.

         This method monitors the outgoing message queue, and sends
         new messages to the browser when they come in (or until
         self.poll_interval seconds elapsed)."""
         poll_elapsed = time.time() - self.poll_start
         if poll_elapsed<0.1:
             poll_step = 0.01
         else:
             poll_step = 0.1
         if poll_elapsed<self.poll_interval:
             if self.session.outgoing:
                 msg = self.session.outgoing.pop()
                 self.write(msg)
                 self.finish()
             else:
                 tornado.ioloop.IOLoop.instance().add_timeout(
                     time.time()+poll_step,self.poll)
         else:
             self.set_status(304)
             self.finish()

And here is my problem. If I point 5 browsers to the server, then I get 
2% CPU load (Intel i5 2.8GHz on amd64 Linux). But why? Most of the time, 
the server should be sleeping. cProfile tells this:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000  845.146  845.146 <string>:1(<module>)
   1135775  832.283    0.001  832.283    0.001 {method 'poll' of 
'select.epoll' objects}

I have copied out the two relevant rows only. As you can see, total 
runtime was 845 seconds, and 832 seconds were spent in "epoll". 
Apparently, CPU load goes up linearly as I connect more clients. It 
means that 50 connected clients would do 20% CPU load. Which is 
ridiculous, because they don't do anything but wait for messages to be 
processed. Something terribly wrong, but I cannot figure out what?

Actually I could not try this with 50 clients. If I open 15 clients, 
then the server starts dropping connections. (Tried from Firefox and 
Chrome too.) If I change the poll() method this way:

     else:
             print "No messages after %.2f seconds"%poll_elapsed
             self.set_status(304)
             self.finish()

then I see this in the log:

No messages after 10.01 seconds
ERROR:root:Uncaught exception POST /client (127.0.0.1)
HTTPRequest(protocol='http', host='127.0.0.1:8888', method='POST', 
uri='/client', version='HTTP/1.1', remote_ip='127.0.0.1', 
body='_xsrf=df157469a62142d7b28c5a4880dd8478&action=poll', 
headers={'Referer': 'http://127.0.0.1:8888/', 'Content-Length': '50', 
'Accept-Language': 'en-us;q=0.8,en;q=0.5', 'Accept-Encoding': 'gzip, 
deflate', 'Host': '127.0.0.1:8888', 'Accept': '*/*', 'User-Agent': 
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 
Firefox/15.0', 'Connection': 'keep-alive', 'X-Requested-With': 
'XMLHttpRequest', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 
'Cookie': 
'sid="MS1acHd5b3V1WHFOQU1BbTVmSXJEeVhkLys=|1346652787|e045d786fdb89b73220a2c77ef89572d0c16901e"; 
_xsrf=df157469a62142d7b28c5a4880dd8478; 
xsfr=df157469a62142d7b28c5a4880dd8478', 'Content-Type': 
'application/x-www-form-urlencoded; charset=UTF-8'})
Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", 
line 183, in wrapped
     callback(*args, **kwargs)
   File "/home/gandalf/Python/Projects/test/client.py", line 67, in poll
     self.finish()
   File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 641, in 
finish
     self.request.finish()
   File "/usr/lib/python2.7/dist-packages/tornado/httpserver.py", line 
411, in finish
     self.connection.finish()
   File "/usr/lib/python2.7/dist-packages/tornado/httpserver.py", line 
186, in finish
     self._finish_request()
   File "/usr/lib/python2.7/dist-packages/tornado/httpserver.py", line 
213, in _finish_request
     self.stream.read_until(b("\r\n\r\n"), self._header_callback)
   File "/usr/lib/python2.7/dist-packages/tornado/iostream.py", line 
151, in read_until
     self._check_closed()
   File "/usr/lib/python2.7/dist-packages/tornado/iostream.py", line 
493, in _check_closed
     raise IOError("Stream is closed")
IOError: Stream is closed

What is even more interesting is that on the client side, Firebug tells 
me that the connection was dropped after 15 seconds. The JavaScript 
message loop is operated by a jQuery AJAX call, that has timeout=15000 
parameter given:

<snip>
     that.messageLoop = function () {
         var xsfr = that.readCookie("xsfr");
         $.ajax({
           url: '/client',
           type: "POST",
           data: {"_xsrf":xsfr,"action":"poll"},
           async: true,
           cache: false,
           timeout: 15000,
           error: function (data) {
                 setTimeout( that.messageLoop , 1000);
                 console.log(data);
           },
           success: function (data) {
             if (data) {
                 try {
                     eval(data);
                 }
                 catch(err) {
                   $().toastmessage('showErrorToast', err.message);
                 }
             }
             setTimeout( that.messageLoop , 10);
           }
         });
     };
</snip>

But on the server side, I see that the connection was dropped after 
10.01 seconds. If I start 20 clients, then about every second poll 
request gets dropped.

Thanks,

    Laszlo

[toc] | [next] | [standalone]

#28367

From	Bryan <bryanjugglercryptographer@yahoo.com>
Date	2012-09-03 12:47 -0700
Message-ID	<f06cfff2-0367-4548-a09b-736191699f29@k3g2000pbr.googlegroups.com>
In reply to	#28341

Laszlo Nagy wrote:
[...]
> And here is my problem. If I point 5 browsers to the server, then I get
> 2% CPU load (Intel i5 2.8GHz on amd64 Linux). But why? Most of the time,
> the server should be sleeping. cProfile tells this:
>
>     ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>          1    0.000    0.000  845.146  845.146 <string>:1(<module>)
>    1135775  832.283    0.001  832.283    0.001 {method 'poll' of
> 'select.epoll' objects}
>
> I have copied out the two relevant rows only. As you can see, total
> runtime was 845 seconds, and 832 seconds were spent in "epoll".
> Apparently, CPU load goes up linearly as I connect more clients. It
> means that 50 connected clients would do 20% CPU load. Which is
> ridiculous, because they don't do anything but wait for messages to be
> processed. Something terribly wrong, but I cannot figure out what?

What's wrong is the 1,135,775 calls to "method 'poll' of
'select.epoll' objects".
With five browsers waiting for messages over 845 seconds, that works
out to each  waiting browser inducing 269 epolls per second.

Almost equally important is what the problem is *not*. The problem is
*not* spending the vast majority of time in epoll; that's *good* news.
The problem is *not* that CPU load goes up linearly as we connect more
clients. This is an efficiency problem, not a scaling problem.

So what's the fix? I'm not a Tornado user; I don't have a patch.
Obviously Laszlo's polling strategy is not performing, and the
solution is to adopt the event-driven approach that epoll and Tornado
do well.

-Bryan

[toc] | [prev] | [next] | [standalone]

#28393

From	Laszlo Nagy <gandalf@shopzeus.com>
Date	2012-09-04 09:30 +0200
Message-ID	<mailman.170.1346744270.27098.python-list@python.org>
In reply to	#28367

> What's wrong is the 1,135,775 calls to "method 'poll' of
> 'select.epoll' objects".
I was affraid you are going to say that. :-)
> With five browsers waiting for messages over 845 seconds, that works
> out to each  waiting browser inducing 269 epolls per second.
>
> Almost equally important is what the problem is *not*. The problem is
> *not* spending the vast majority of time in epoll; that's *good* news.
> The problem is *not* that CPU load goes up linearly as we connect more
> clients. This is an efficiency problem, not a scaling problem.
>
> So what's the fix? I'm not a Tornado user; I don't have a patch.
> Obviously Laszlo's polling strategy is not performing, and the
> solution is to adopt the event-driven approach that epoll and Tornado
> do well.
Actually, I have found a way to overcome this problem, and it seems to 
be working. Instead of calling add_timeout from every request, I save 
the request objects in a list, and operate a "message distributor" 
service in the background that routes messages to clients, and finish 
their long poll requests when needed. The main point is that the 
"message distributor" has a single entry point, and it is called back at 
given intervals. So the number of callbacks per second does not increase 
with the number of clients. Now the CPU load is about 1% with one 
client, and it is the same with 15 clients. While the response time is 
the same (50-100msec). It is efficient enough for me.

I understand that most people do a different approach: they do a fast 
poll request from the browser in every 2 seconds or so. But this is not 
good for me, because then it can take 2 seconds to send a message from 
one browser into another that is not acceptable in my case. Implementing 
long polls with a threaded server would be trivial, but a threaded 
server cannot handle 100+ simultaneous (long running) requests, because 
that would require 100+ threads to be running.

This central "message distributor" concept seems to be working. About 
1-2% CPU overhead I have to pay for being able to send messages from one 
browser into another within 100msec, which is fine.

I could have not done this without your help.

Thank you!

    Laszlo

[toc] | [prev] | [next] | [standalone]

#28396

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-09-04 01:21 -0700
Message-ID	<7xvcfui4ff.fsf@ruckus.brouhaha.com>
In reply to	#28393

Laszlo Nagy <gandalf@shopzeus.com> writes:
> but a threaded server cannot handle 100+ simultaneous (long running)
> requests, because that would require 100+ threads to be running.

On a reasonable server these days, 100 threads seems to be no big deal.
I've run several times that many.  I think things get ragged at a few
thousand threads.

[toc] | [prev] | [standalone]

csiph-web

tornado.web ioloop add_timeout eats CPU

Contents

#28341 — tornado.web ioloop add_timeout eats CPU

#28367

#28393

#28396