tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Thomas <>
Subject Re: Clustering / High Availability edge cases?
Date Thu, 15 Sep 2011 07:16:29 GMT
On 13/09/2011 10:51, John Bass wrote:
> Hi all,
> I'm relatively new to clustering with Tomcat and I'm trying to understand
> the edge cases.  If I'd like to guarantee continuous availability, what are
> the caveats?
> As I understand it, Tomcat clustering will ensure that session information
> is persisted in the event of a failure.  That's fine, however, what about
> long running I/O operations?  What if my node dies in the middle of serving
> an HTTP response?  In the event of a node failure, I'm assuming that there's
> no way to recover from that and the failure will be visible to a client
> application.

Wrong. Recovery options depend on the exact failure mode and the
load-balancer configuration.

The typical sequence of events is:
- load-balancer sends request to Tomcat
- request fails
- load-balancer detects failure (either by return code or lack of response)
- load-balancer replays request to a different Tomcat node
- Tomcat generates response
- load-balancer returns response to the client
- client is unaware of failure although the request may appear slow
particularly if the failure was detected via a timeout

The load-balacer configuration will control the exact circumstances
under which a request will be replayed.

> Similarly, if a node fails during a long running calculation, I'm assuming
> that there's no way to persist that execution state.

Out of the box, no. You'd need to code that within the app.

> Are those assumptions correct?  If anyone has any other comments on further
> scenarios where clustering and session persistence will not be useful in an
> HA context, i'd love to hear them.

Another failure mode to consider is node failure after the request has
been processed but before the updated session data has been replicated
to other nodes in the cluster.

If you use synchronous replication (the replication happens before the
response is completed) then this can't happen but your responses are
delayed until the replication completes.

If you use asynchronous replication then there is the possibility of
node failure before the data is replicated. Also, you must use sticky
sessions in this case since you don't want the next request being
directed to a different node before the updated session data has been

Finally, if using the back-up manager multiple node failures in quick
succession will cause the loss of session data. With this manager, each
node distributes the backup copies of the session data (each primary
session has a single backup) around the other nodes in the cluster. So,
for example, in a four node cluster if node A has 30 primary sessions 10
of those will be backed up on node B, 10 on node C and 10 on node D.

If node A fails, the remaining nodes will detect this, make themselves
the primary node for the sessions they are backing up and start the
process of creating new backups on one of the remaining nodes. If a
second node fails before this is complete there is the possibility of
session loss.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message