tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Koecke>
Subject Re: jk_lb_worker.c patch
Date Mon, 06 May 2002 08:28:17 GMT
Hi Costin,

thanks for commiting my patch :). After thinking about it, I found the same 
problem like Mathias. It's a problem for my environment too. We have the same 
problem with shutdown and recovering here. I'm on the way of looking in jk2. The 
question for jk1 is, what want we do if the main worker fails because of an error?

Because the normal intention of lb is to switch to another worker in such case. 
But for the special use of a main worker we don't want that (at least it is an 
error in my environment here :) ). My suggestion is to add an additional flag to 
the lb_worker struct where we hold the information that we have a main worker, 
e.g main_worker_mode. Because of this flag we send only requests with a session 
id to one of the other worker. And we could change the behavior after an error 
of an other worker and check his state only if we get a request with his session 
route. This would be easy if we set the main worker at the begining of the 
worker list and/or use the flag. But we need the flag if we want to use more the 
one main worker.

But what should happen if the main worker is in error state? In my patch some 
weeks ago I added an additional flag which causes the module to reject a request 
if it comes in without a session id and the main worker is down. If this flag 
wasn't set or was not set to reject the module chooses one of the other worker. 
For our environment here rejecting the request is ok, because if a request 
without a session comes to a switched off node, we have a problem with our 
separated load balancer. This should never happen. We could make this rejecting 
be the standard if we have a main worker, but with a separate flag it would be 
more flexible.

I will build a patch against cvs to make my intention clearer.

Bernd wrote:
> Hi Mathias,
> I think we understand your use case, it is not very uncommon.
> In fact, as I mentioned few times, it is the 'main' use
> case for Apache ( multi-process ) when using the JNI worker.
> In this case Apache acts as a 'natural' load-balancer, with 
> requests going to various processes ( more or less randomly ).
> As in your case, requests without a session should allways go
> to the worker that is in the same process.
> The main reason for using '0' for the "local" worker is that
> in jk2 I want to switch from float to int - there is no reason
> ( AFAIK ) to do all the float computation, even a short int
> will be enough for the purpose of implementing a round-roubin
> with weitghs.
> BTW, one extension I'm trying to make is support for multiple
> local workers - I'm still thining on how to do that. This will
> cover the case of few big boxes, each with several tomcat 
> instances ( if you have many G of RAM and many processors, sometimes
> is better to run more VMs instead of a single large process ) 
> In this case you still want some remote tomcats, for failover,
> but most load should go to the local workers.
> For jk2 I already fixed the selection of the 'recovering' worker,
> after timeout the worker will go through normal selection instead
> of beeing automatically chosen.
> For jk1 - I'm waiting for patches :-) I wouldn't do a big change -
> the current fix seemed like a good one. 
> I agree that changing the meaning of 0 may be confusing ( is it
> documented ? my says it should never be used ).
> We can fix that by using an additional flag - and not using 
> special values.
> Another special note - Jk2 will also support 'gracefull shutdown',
> that means your case ( replacing a webapp ) will be handled
> in a different way. You should be able to add/remove workers
> without restarting apache ( and I hope mostly automated ). 
> Let me know what you think - with patches if possible :-)
> Costin
>>The setup I use is the following, a load balancer (Alteon) is in front
>>of several Apache servers, each hosted on a machine which also hosts a
>>Let's call those Apache servers A1, A2 and A3 and the associated Tomcat
>>servers T1, T2 and T3.
>>I have been using Paul's patch which I modified so the lb_value field of
>>fault tolerant workers would not be changed to a value other than INF.
>>The basic setup is that Ai can talk to all Tj, but for requests not
>>associated with a session, Ti will be used unless it is unavailable.
>>Sessions belonging to Tk will be correctly routed. The load balancing
>>worker definition is different for all three Ai, the lbfactor is set to
>>0 for workers connecting to Tk for all k != i and set to 1.0 for the
>>worker connecting to Ti.
>>This setup allows to have sticky sessions independently of the Apache
>>handling the request, which is a good thing since the Alteon cannot
>>extract the ';jsessionid=.....' part from the URL in a way which allows
>>the dispatching of the requests to the proper Ai (the cookie is dealed
>>with correctly though).
>>This works perfectly except when we roll out a new release of our
>>webapps. In this case it would be ideal to be able to make the load
>>balancer ignore one Apache server, deploy the new version of the webapp
>>on this server, and switch this server back on and the other two off so
>>the service interruption would be as short as possible for the
>>customers. The immediate idea, if Ai/Ti is to be the first server to
>>have the new webapp, is to stop Ti so Ai will not be selected by the
>>load balancer. This does not work, indeed with Paul's patch Ti is the
>>preferred server BUT if Ti fails then another Tk will be selected by Ai,
>>therefore the load balancer will never declare Ai failed (even though we
>>managed to make it behave like this by specifying a test URL which
>>includes a jvmroute to Ti, but this uses lots of slb groups on the
>>alteon) and it will continue to send requests to it.
>>Bernd's patch allows Ai to reject requests if Ti is stopped, the load
>>balancer will therefore quickly declare Ai inactive and will stop send
>>it requests, thus allowing to roll out the new webapp very easily, just
>>set up the new webapp, restart Ti, restart Ai, and as soon as the load
>>balancer sees Ai, shut down the other two Ak, the current sessions will
>>still be routed to the old webapp, and the new sessions will see the new
>>version. When there are no more sessions on the old version, shut down
>>Tk (k != i) and deploy the new webapp.
>>My remark concerning the possible selection of recovering workers prior
>>to the local worker (one with lb_value set to 0) deals with the load
>>balancer not being able in this case to declare Ai inactive.
>>I hope I have been clear enough, and that everybody got the point, if
>>not I'd be glad to explain more thoroughly.
>>Paul Frieden wrote:
>>>I'm afraid that I am no longer subscribed to the devel list.  I would be
>>>happy to add my advice for this issue, but I don't have time to keep up
>>>with the entire devel list.  If there is anything I can do, please just
>>>mail me directly.
>>>I chose to use the value 0 for a worker because it used the inverse of
>>>the value specified.  The value 0 then resulted in essentially infinite
>>>preference.  I used that approach purely because it was the smallest
>>>change possible, and the least likely to change the expected behavior
>>>for anybody else.  The path of least astonishment and whatnot.  I would
>>>be concerned about changing the current behavior now, because people
>>>probably want a drop in replacement.  If there is going to be a change
>>>in the algorithm and behavior, a different approach may be better.
>>>I would also like to make a note of how we were using this code.  In our
>>>environment, we have an external dedicated load balancer, and three web
>>>servers.  The main problem that we ran into was with AOL users.  AOL
>>>uses a proxy that randomizes the source IP of requests.  That means that
>>>you can no longer count on the source IP to tell the load balancer which
>>>server to send future requests to.  We used this code to allow sessions
>>>that arive on the wrong web server to be redirected to the tomcat on the
>>>correct server.  This neatly side-steps the whole issue of changing IPs,
>>>because apache is able to make the decision based on the session ID.
>>>The reliability issue was a nice side effect for us in that it caught a
>>>failed server more quickly than the load balancer did, and prevented the
>>>user from having a connection time out or seeing an error message.
>>>I hope this provides some insight into why I changed the code that I
>>>did, and why that behavior worked well for us.
>>> wrote:
>>>>Hi Mathias,
>>>>I think it would be better to discuss this on tomcat-dev.
>>>>The 'error' worker will not be choosen unless the
>>>>timeout expires. When the timeout expires, we'll indeed
>>>>select it ( in preference to the default ) - this is easy to fix
>>>>if it creates problems, but I don't see why it would be a
>>>>If it is working, next request will be served normally by
>>>>the default. If not, it'll go back to error state.
>>>>In jk2 I removed that - error workers are no longer
>>>>selected. But for jk1 I would rather leave the old
>>>>behavior intact.
>>>>Note that the reason for choosing 0 ( in jk2 ) as
>>>>default is that I want to switch from float to ints,
>>>>I'm not convinced floats are good for performance
>>>>( or needed ).
>>>>Again - I'm just learning and trying, if you have
>>>>any idea I would be happy to hear them, patches
>>>>are more than wellcome.
>>>>On Sat, 4 May 2002, Mathias Herberts wrote:
>>>>>Hi,  I  just  joined  the  Tomcat-dev  list  and  saw  your  patch  to
>>>>>jk_lb_worker.c (making it version 1.9).
>>>>>If I understand well your patch it offers the same behaviors as Paul's
>>>>>patch  but with  an opposite  semantic for  a lbfactor  of 0.0  in the
>>>>>worker's definition,  i.e. a  value of 0.0  now means ALWAYS  USE THIS
>>>>>FOR REQUESTS WITH NO SESSIONS. This seems fine to me.
>>>>>What disturbs  me is  what is  happening when one  worker is  in error
>>>>>state  and not  yet recovering.  In get_most_suitable  worker,  such a
>>>>>worker will  be selected whatever  its lb_value, meaning  a recovering
>>>>>worker will  have priority over  one with a  lb_value of 0.0  and this
>>>>>seems to break the behavior we had achieved with your patch.
>>>>>Did I miss something or is this really a problem?
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>

Dipl.-Inform. Bernd Koecke
Schlund+Partner AG
Fon: +49-721-91374-0

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message