tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Herberts <>
Subject Re: jk_lb_worker.c patch
Date Sat, 04 May 2002 20:40:23 GMT

I included the thread that started via email so the list can follow on.

The setup I use is the following, a load balancer (Alteon) is in front
of several Apache servers, each hosted on a machine which also hosts a
Let's call those Apache servers A1, A2 and A3 and the associated Tomcat
servers T1, T2 and T3.

I have been using Paul's patch which I modified so the lb_value field of
fault tolerant workers would not be changed to a value other than INF.

The basic setup is that Ai can talk to all Tj, but for requests not
associated with a session, Ti will be used unless it is unavailable.
Sessions belonging to Tk will be correctly routed. The load balancing
worker definition is different for all three Ai, the lbfactor is set to
0 for workers connecting to Tk for all k != i and set to 1.0 for the
worker connecting to Ti.

This setup allows to have sticky sessions independently of the Apache
handling the request, which is a good thing since the Alteon cannot
extract the ';jsessionid=.....' part from the URL in a way which allows
the dispatching of the requests to the proper Ai (the cookie is dealed
with correctly though).

This works perfectly except when we roll out a new release of our
webapps. In this case it would be ideal to be able to make the load
balancer ignore one Apache server, deploy the new version of the webapp
on this server, and switch this server back on and the other two off so
the service interruption would be as short as possible for the
customers. The immediate idea, if Ai/Ti is to be the first server to
have the new webapp, is to stop Ti so Ai will not be selected by the
load balancer. This does not work, indeed with Paul's patch Ti is the
preferred server BUT if Ti fails then another Tk will be selected by Ai,
therefore the load balancer will never declare Ai failed (even though we
managed to make it behave like this by specifying a test URL which
includes a jvmroute to Ti, but this uses lots of slb groups on the
alteon) and it will continue to send requests to it.

Bernd's patch allows Ai to reject requests if Ti is stopped, the load
balancer will therefore quickly declare Ai inactive and will stop send
it requests, thus allowing to roll out the new webapp very easily, just
set up the new webapp, restart Ti, restart Ai, and as soon as the load
balancer sees Ai, shut down the other two Ak, the current sessions will
still be routed to the old webapp, and the new sessions will see the new
version. When there are no more sessions on the old version, shut down
Tk (k != i) and deploy the new webapp.

My remark concerning the possible selection of recovering workers prior
to the local worker (one with lb_value set to 0) deals with the load
balancer not being able in this case to declare Ai inactive.

I hope I have been clear enough, and that everybody got the point, if
not I'd be glad to explain more thoroughly.


Paul Frieden wrote:
> Hello,
> I'm afraid that I am no longer subscribed to the devel list.  I would be
> happy to add my advice for this issue, but I don't have time to keep up
> with the entire devel list.  If there is anything I can do, please just
> mail me directly.
> I chose to use the value 0 for a worker because it used the inverse of
> the value specified.  The value 0 then resulted in essentially infinite
> preference.  I used that approach purely because it was the smallest
> change possible, and the least likely to change the expected behavior
> for anybody else.  The path of least astonishment and whatnot.  I would
> be concerned about changing the current behavior now, because people
> probably want a drop in replacement.  If there is going to be a change
> in the algorithm and behavior, a different approach may be better.
> I would also like to make a note of how we were using this code.  In our
> environment, we have an external dedicated load balancer, and three web
> servers.  The main problem that we ran into was with AOL users.  AOL
> uses a proxy that randomizes the source IP of requests.  That means that
> you can no longer count on the source IP to tell the load balancer which
> server to send future requests to.  We used this code to allow sessions
> that arive on the wrong web server to be redirected to the tomcat on the
> correct server.  This neatly side-steps the whole issue of changing IPs,
> because apache is able to make the decision based on the session ID.
> The reliability issue was a nice side effect for us in that it caught a
> failed server more quickly than the load balancer did, and prevented the
> user from having a connection time out or seeing an error message.
> I hope this provides some insight into why I changed the code that I
> did, and why that behavior worked well for us.
> Paul
> wrote:
> >Hi Mathias,
> >
> >I think it would be better to discuss this on tomcat-dev.
> >
> >The 'error' worker will not be choosen unless the
> >timeout expires. When the timeout expires, we'll indeed
> >select it ( in preference to the default ) - this is easy to fix
> >if it creates problems, but I don't see why it would be a
> >problem.
> >
> >If it is working, next request will be served normally by
> >the default. If not, it'll go back to error state.
> >
> >In jk2 I removed that - error workers are no longer
> >selected. But for jk1 I would rather leave the old
> >behavior intact.
> >
> >Note that the reason for choosing 0 ( in jk2 ) as
> >default is that I want to switch from float to ints,
> >I'm not convinced floats are good for performance
> >( or needed ).
> >
> >Again - I'm just learning and trying, if you have
> >any idea I would be happy to hear them, patches
> >are more than wellcome.
> >
> >Costin
> >
> >On Sat, 4 May 2002, Mathias Herberts wrote:
> >
> >
> >
> >>Hi,  I  just  joined  the  Tomcat-dev  list  and  saw  your  patch  to
> >>jk_lb_worker.c (making it version 1.9).
> >>
> >>If I understand well your patch it offers the same behaviors as Paul's
> >>patch  but with  an opposite  semantic for  a lbfactor  of 0.0  in the
> >>worker's definition,  i.e. a  value of 0.0  now means ALWAYS  USE THIS
> >>FOR REQUESTS WITH NO SESSIONS. This seems fine to me.
> >>
> >>What disturbs  me is  what is  happening when one  worker is  in error
> >>state  and not  yet recovering.  In get_most_suitable  worker,  such a
> >>worker will  be selected whatever  its lb_value, meaning  a recovering
> >>worker will  have priority over  one with a  lb_value of 0.0  and this
> >>seems to break the behavior we had achieved with your patch.
> >>
> >>Did I miss something or is this really a problem?
> >>
> >>Mathias.
> >>
> >>
> >>
> >
> >
> >

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message