tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Koecke>
Subject Re: jk_lb_worker.c patch
Date Mon, 06 May 2002 15:34:26 GMT
Hi Costin,

at the moment I'm testing my patch and it seems to work. I'll send it soon with 
an explanation what I did. It's not the same as Mathias did. I used the 0 value 
and one additional config flag which results in two flags in lb_worker struct.

But I think your suggestion for jk2 is the better way. Without magic 0 values 
etc. After we get the lb_worker stable it would be better to build the new stuff 
in jk2. With the structure of jk1 I think it is a little bit difficult to build 
the desired behavior.

Bernd wrote:
> Bernd,
> At this moment I believe we should add flags and stop using the '0' value
> in the config file.
> Internally ( in the code ) - it doesn't matter, we can keep 0 or
> use the flag ( I prefer the second ).
> I'm waiting for your patch - it seems there is another bug that must 
> be fixed before we can tag - but I hope we can finish all changes in
> the next few days.
> Costin
> On Mon, 6 May 2002, Bernd Koecke wrote:
>>thanks for commiting my patch :). After thinking about it, I found the same 
>>problem like Mathias. It's a problem for my environment too. We have the same 
>>problem with shutdown and recovering here. I'm on the way of looking in jk2. The 
>>question for jk1 is, what want we do if the main worker fails because of an error?
>>Because the normal intention of lb is to switch to another worker in such case. 
>>But for the special use of a main worker we don't want that (at least it is an 
>>error in my environment here :) ). My suggestion is to add an additional flag to 
>>the lb_worker struct where we hold the information that we have a main worker, 
>>e.g main_worker_mode. Because of this flag we send only requests with a session 
>>id to one of the other worker. And we could change the behavior after an error 
>>of an other worker and check his state only if we get a request with his session 
>>route. This would be easy if we set the main worker at the begining of the 
>>worker list and/or use the flag. But we need the flag if we want to use more the 
>>one main worker.
>>But what should happen if the main worker is in error state? In my patch some 
>>weeks ago I added an additional flag which causes the module to reject a request 
>>if it comes in without a session id and the main worker is down. If this flag 
>>wasn't set or was not set to reject the module chooses one of the other worker. 
>>For our environment here rejecting the request is ok, because if a request 
>>without a session comes to a switched off node, we have a problem with our 
>>separated load balancer. This should never happen. We could make this rejecting 
>>be the standard if we have a main worker, but with a separate flag it would be 
>>more flexible.
>>I will build a patch against cvs to make my intention clearer.
>> wrote:
>>>Hi Mathias,
>>>I think we understand your use case, it is not very uncommon.
>>>In fact, as I mentioned few times, it is the 'main' use
>>>case for Apache ( multi-process ) when using the JNI worker.
>>>In this case Apache acts as a 'natural' load-balancer, with 
>>>requests going to various processes ( more or less randomly ).
>>>As in your case, requests without a session should allways go
>>>to the worker that is in the same process.
>>>The main reason for using '0' for the "local" worker is that
>>>in jk2 I want to switch from float to int - there is no reason
>>>( AFAIK ) to do all the float computation, even a short int
>>>will be enough for the purpose of implementing a round-roubin
>>>with weitghs.
>>>BTW, one extension I'm trying to make is support for multiple
>>>local workers - I'm still thining on how to do that. This will
>>>cover the case of few big boxes, each with several tomcat 
>>>instances ( if you have many G of RAM and many processors, sometimes
>>>is better to run more VMs instead of a single large process ) 
>>>In this case you still want some remote tomcats, for failover,
>>>but most load should go to the local workers.
>>>For jk2 I already fixed the selection of the 'recovering' worker,
>>>after timeout the worker will go through normal selection instead
>>>of beeing automatically chosen.
>>>For jk1 - I'm waiting for patches :-) I wouldn't do a big change -
>>>the current fix seemed like a good one. 
>>>I agree that changing the meaning of 0 may be confusing ( is it
>>>documented ? my says it should never be used ).
>>>We can fix that by using an additional flag - and not using 
>>>special values.
>>>Another special note - Jk2 will also support 'gracefull shutdown',
>>>that means your case ( replacing a webapp ) will be handled
>>>in a different way. You should be able to add/remove workers
>>>without restarting apache ( and I hope mostly automated ). 
>>>Let me know what you think - with patches if possible :-)
>>>>The setup I use is the following, a load balancer (Alteon) is in front
>>>>of several Apache servers, each hosted on a machine which also hosts a
>>>>Let's call those Apache servers A1, A2 and A3 and the associated Tomcat
>>>>servers T1, T2 and T3.
>>>>I have been using Paul's patch which I modified so the lb_value field of
>>>>fault tolerant workers would not be changed to a value other than INF.
>>>>The basic setup is that Ai can talk to all Tj, but for requests not
>>>>associated with a session, Ti will be used unless it is unavailable.
>>>>Sessions belonging to Tk will be correctly routed. The load balancing
>>>>worker definition is different for all three Ai, the lbfactor is set to
>>>>0 for workers connecting to Tk for all k != i and set to 1.0 for the
>>>>worker connecting to Ti.
>>>>This setup allows to have sticky sessions independently of the Apache
>>>>handling the request, which is a good thing since the Alteon cannot
>>>>extract the ';jsessionid=.....' part from the URL in a way which allows
>>>>the dispatching of the requests to the proper Ai (the cookie is dealed
>>>>with correctly though).
>>>>This works perfectly except when we roll out a new release of our
>>>>webapps. In this case it would be ideal to be able to make the load
>>>>balancer ignore one Apache server, deploy the new version of the webapp
>>>>on this server, and switch this server back on and the other two off so
>>>>the service interruption would be as short as possible for the
>>>>customers. The immediate idea, if Ai/Ti is to be the first server to
>>>>have the new webapp, is to stop Ti so Ai will not be selected by the
>>>>load balancer. This does not work, indeed with Paul's patch Ti is the
>>>>preferred server BUT if Ti fails then another Tk will be selected by Ai,
>>>>therefore the load balancer will never declare Ai failed (even though we
>>>>managed to make it behave like this by specifying a test URL which
>>>>includes a jvmroute to Ti, but this uses lots of slb groups on the
>>>>alteon) and it will continue to send requests to it.
>>>>Bernd's patch allows Ai to reject requests if Ti is stopped, the load
>>>>balancer will therefore quickly declare Ai inactive and will stop send
>>>>it requests, thus allowing to roll out the new webapp very easily, just
>>>>set up the new webapp, restart Ti, restart Ai, and as soon as the load
>>>>balancer sees Ai, shut down the other two Ak, the current sessions will
>>>>still be routed to the old webapp, and the new sessions will see the new
>>>>version. When there are no more sessions on the old version, shut down
>>>>Tk (k != i) and deploy the new webapp.
>>>>My remark concerning the possible selection of recovering workers prior
>>>>to the local worker (one with lb_value set to 0) deals with the load
>>>>balancer not being able in this case to declare Ai inactive.
>>>>I hope I have been clear enough, and that everybody got the point, if
>>>>not I'd be glad to explain more thoroughly.
>>>>Paul Frieden wrote:
>>>>>I'm afraid that I am no longer subscribed to the devel list.  I would
>>>>>happy to add my advice for this issue, but I don't have time to keep up
>>>>>with the entire devel list.  If there is anything I can do, please just
>>>>>mail me directly.
>>>>>I chose to use the value 0 for a worker because it used the inverse of
>>>>>the value specified.  The value 0 then resulted in essentially infinite
>>>>>preference.  I used that approach purely because it was the smallest
>>>>>change possible, and the least likely to change the expected behavior
>>>>>for anybody else.  The path of least astonishment and whatnot.  I would
>>>>>be concerned about changing the current behavior now, because people
>>>>>probably want a drop in replacement.  If there is going to be a change
>>>>>in the algorithm and behavior, a different approach may be better.
>>>>>I would also like to make a note of how we were using this code.  In our
>>>>>environment, we have an external dedicated load balancer, and three web
>>>>>servers.  The main problem that we ran into was with AOL users.  AOL
>>>>>uses a proxy that randomizes the source IP of requests.  That means that
>>>>>you can no longer count on the source IP to tell the load balancer which
>>>>>server to send future requests to.  We used this code to allow sessions
>>>>>that arive on the wrong web server to be redirected to the tomcat on the
>>>>>correct server.  This neatly side-steps the whole issue of changing IPs,
>>>>>because apache is able to make the decision based on the session ID.
>>>>>The reliability issue was a nice side effect for us in that it caught
>>>>>failed server more quickly than the load balancer did, and prevented the
>>>>>user from having a connection time out or seeing an error message.
>>>>>I hope this provides some insight into why I changed the code that I
>>>>>did, and why that behavior worked well for us.
>>>>> wrote:
>>>>>>Hi Mathias,
>>>>>>I think it would be better to discuss this on tomcat-dev.
>>>>>>The 'error' worker will not be choosen unless the
>>>>>>timeout expires. When the timeout expires, we'll indeed
>>>>>>select it ( in preference to the default ) - this is easy to fix
>>>>>>if it creates problems, but I don't see why it would be a
>>>>>>If it is working, next request will be served normally by
>>>>>>the default. If not, it'll go back to error state.
>>>>>>In jk2 I removed that - error workers are no longer
>>>>>>selected. But for jk1 I would rather leave the old
>>>>>>behavior intact.
>>>>>>Note that the reason for choosing 0 ( in jk2 ) as
>>>>>>default is that I want to switch from float to ints,
>>>>>>I'm not convinced floats are good for performance
>>>>>>( or needed ).
>>>>>>Again - I'm just learning and trying, if you have
>>>>>>any idea I would be happy to hear them, patches
>>>>>>are more than wellcome.
>>>>>>On Sat, 4 May 2002, Mathias Herberts wrote:
>>>>>>>Hi,  I  just  joined  the  Tomcat-dev  list  and  saw  your  patch
>>>>>>>jk_lb_worker.c (making it version 1.9).
>>>>>>>If I understand well your patch it offers the same behaviors as
>>>>>>>patch  but with  an opposite  semantic for  a lbfactor  of 0.0
 in the
>>>>>>>worker's definition,  i.e. a  value of 0.0  now means ALWAYS 
>>>>>>>FOR REQUESTS WITH NO SESSIONS. This seems fine to me.
>>>>>>>What disturbs  me is  what is  happening when one  worker is 
in error
>>>>>>>state  and not  yet recovering.  In get_most_suitable  worker,
 such a
>>>>>>>worker will  be selected whatever  its lb_value, meaning  a recovering
>>>>>>>worker will  have priority over  one with a  lb_value of 0.0 
and this
>>>>>>>seems to break the behavior we had achieved with your patch.
>>>>>>>Did I miss something or is this really a problem?
>>>To unsubscribe, e-mail:   <>
>>>For additional commands, e-mail: <>
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>

Dipl.-Inform. Bernd Koecke
Schlund+Partner AG
Fon: +49-721-91374-0

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message