httpd-bugs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 46215] New: Race condition in bybusyness algorithm
Date Sat, 15 Nov 2008 00:24:51 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=46215

           Summary: Race condition in bybusyness algorithm
           Product: Apache httpd-2
           Version: 2.2.10
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: mod_proxy_balancer
        AssignedTo: bugs@httpd.apache.org
        ReportedBy: danudey@gmail.com


Created an attachment (id=22876)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=22876)
Patch resolving the issue

In scenarios with large numbers of backend workers, a race condition can
prevent the 'busy' counter from being decremented properly, causing some
workers to be ignored completely. Typically, all workers should be at the same
busyness when idle.

ps output showing the effect:

       deploy   27326 14.3  3.2 271728 130632 ?       Sl   23:23   1:03
mongrel_rails [8000/0/289]: idle
       deploy   27329 15.4  3.7 298428 150368 ?       Sl   23:23   1:08
mongrel_rails [8001/0/289]: idle
       deploy   27332 16.6  3.8 296292 154976 ?       Sl   23:23   1:13
mongrel_rails [8002/0/288]: idle
       deploy   27335 15.5  3.3 279404 136820 ?       Sl   23:23   1:08
mongrel_rails [8003/0/289]: idle
       deploy   27338 16.6  3.4 280396 139452 ?       Sl   23:23   1:13
mongrel_rails [8004/0/290]: idle
       deploy   27341 13.6  3.3 275600 134724 ?       Sl   23:23   1:00
mongrel_rails [8005/0/288]: idle
       deploy   27344  1.1  1.5 155708 62616 ?        Sl   23:23   0:04
mongrel_rails [8006/0/7]: idle
       deploy   27347 16.2  3.7 299976 153908 ?       Sl   23:23   1:11
mongrel_rails [8007/0/287]: idle
       deploy   27350  1.3  2.5 241708 104364 ?       Sl   23:23   0:05
mongrel_rails [8008/0/5]: idle
       deploy   27354  1.4  2.6 246368 109044 ?       Sl   23:23   0:06
mongrel_rails [8009/0/4]: idle
       deploy   27359  1.0  1.4 151124 58096 ?        Sl   23:23   0:04
mongrel_rails [8010/0/0]: idle
       deploy   27362  0.9  1.4 151140 58112 ?        Sl   23:23   0:04
mongrel_rails [8011/0/0]: idle

balancer-manager output, showing all balancers are in 'Ok' state:

             Worker URL       Route RouteRedir Factor Set Status Elected  To 
From
       http://cimbar:8000                      1      0   Ok     415     315K
22M
       http://cimbar:8001                      1      0   Ok     416     324K
22M
       http://cimbar:8002                      1      0   Ok     484     392K
27M
       http://cimbar:8003                      1      0   Ok     483     381K
26M
       http://cimbar:8004                      1      0   Ok     484     379K
26M
       http://cimbar:8005                      1      0   Ok     484     374K
25M
       http://cimbar:8006                      1      0   Ok     52      44K 
2.6M
       http://cimbar:8007                      1      0   Ok     608     474K
34M
       http://cimbar:8008                      1      0   Ok     53      41K 
2.6M
       http://cimbar:8009                      1      0   Ok     53      43K 
2.9M
       http://cimbar:8010                      1      0   Ok     5       1.1K
6.6K
       http://cimbar:8011                      1      0   Ok     7       1.2K
62K

When we look at the debug logs, however, we see that the 'busy' counter for
port 8006 and 8008-8011 are all at '3', and that these are the ones which are
not receiving requests.

       dan@waterdeep:/var/log/apache2$ for port in {8000..8011}; do fgrep
"bybusyness selected worker \"http://cimbar:${port}" /tmp/logfile | tail -n1;
done
       [Thu Nov 13 23:32:39 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8000" : busy 2 : lbstatus -1922
       [Thu Nov 13 23:32:45 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8001" : busy 2 : lbstatus -1910
       [Thu Nov 13 23:34:24 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8002" : busy 2 : lbstatus -2233
       [Thu Nov 13 23:34:25 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8003" : busy 2 : lbstatus -2236
       [Thu Nov 13 23:34:23 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8004" : busy 2 : lbstatus -2234
       [Thu Nov 13 23:34:24 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8005" : busy 2 : lbstatus -2236
       [Thu Nov 13 23:32:45 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8006" : busy 3 : lbstatus 2468
       [Thu Nov 13 23:34:25 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8007" : busy 1 : lbstatus -3444
       [Thu Nov 13 23:33:54 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8008" : busy 3 : lbstatus 2724
       [Thu Nov 13 23:32:43 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8009" : busy 3 : lbstatus 2459
       [Thu Nov 13 23:32:39 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8010" : busy 3 : lbstatus 2987
       [Thu Nov 13 23:32:45 2008] [debug] mod_proxy_balancer.c(1173): proxy:
bybusyness selected worker "http://cimbar:8011" : busy 3 : lbstatus 2983

I've traced this effect to modules/proxy/mod_proxy_balancer.c:578-579 where the
counter is incremented. There is no mutex locking around this code, so a race
condition ensues where simultaneous incrementing and decrementing by separate
threads on separate CPUs can result in an upward skew.

While it's likely that the skew would occur downwards as well, the decrement
function only works if busyness is > 0, so it will never go below that point.
It could only skew downwards from busyness of 2 or greater, and at that point
would receive more traffic so it would rapidly rise again. The skew has an
effective upper bound of the maximum concurrent busyness the server handles, as
higher numbers, once skewed, are ignored until all other workers are at the
same level. It would require a very high consistent load to travel far beyond a
busyness of 3.

I've attached a patch which moves the decrement code in
proxy_balancer_post_request() up into the previously-commented-out mutex
lock/unlock code in the same function. This has so far resolved the issue on
our production system.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


Mime
View raw message