From Jess Holle <>
Subject Re: mod_proxy/mod_proxy_balancer bug
Date Tue, 14 Apr 2009 21:23:49 GMT
Jess Holle wrote:
> proxy_handler() calls ap_proxy_pre_request() inside a do loop over 
> balanced workers.
> This in turn calls proxy_balancer_pre_request() which does
>     (*worker)->s->busy++.
> Correspondingly proxy_balancer_post_request() does:
>         if (worker && worker->s->busy)
>             worker->s->busy--;
> Unfortunately, proxy_handler only calls proxy_run_post_request() and 
> thus proxy_balancer_post_request() outside the do loop.  Thus the 
> "busy" count of workers which currently cannot take requests (e.g. 
> that are currently dead) increases without bound due to retries -- and 
> is never reset.
> Does anyone (i.e. who is more familiar with this code) have 
> suggestions for how this should be fixed?  If not, I can take a swing 
> at it.
> Similarly, when retrying workers in various routines in 
> mod_proxy_balancer.c those worker's lbstatus is incremented.  If the 
> retry fails, however, the lbstatus is never reset.  This issue also 
> leads to an lbstatus that increases without bound.  Just because a 
> worker was dead for 8 hours does not mean it can handle all the work 
> load now.  It needs to start fresh -- not 8 hours in the hole.  This 
> issue also creates an unduly huge impact when doing
>     mycandidate->s->lbstatus -= total_factor;
Actually I'm offbase here.  total_factor places undue emphasis on any 
worker that satisfies a request when multiple dead workers are retried.  
For instance, if there are 7 dead workers, all being retried, 2 healthy 
workers, and all with an lbfactor of 1 the worker that gets the request 
gets its lbstatus decremented by 9, whereas it really should only be 
decremented by 2 -- else the weighting gets thrown way off.  However, it 
is /not/ thrown off more due to the huge lbstatus values that build up 
in dead workers.  That only becomes an issue when dead workers come to life.
> We're seeing the load balancing be thrown dramatically off in this case.
> Does anyone have suggestions for how this should be fixed?  If not, 
> again I can take a swing at this, e.g. reseting lbstatus to 0 in 
> ap_proxy_retry_worker().
> It *seems* like both of the issue center on handling of dead workers, 
> especially having a multiple dead workers and/or workers that are dead 
> for long periods of time.
> I've not yet checked whether mod_jk (where I believe these basic 
> algorithms came from) has similar issues.
> --
> Jess Holle

