Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Tomcat Developers List" <dev@tomcat.apache.org>
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <48E8CE60.30100@kippdata.de>
Date: Sun, 05 Oct 2008 16:25:36 +0200
From: Rainer Jung <rainer.jung@kippdata.de>
User-Agent: Thunderbird 2.0.0.17 (Windows/20080914)
MIME-Version: 1.0
To: Tomcat Developers List <dev@tomcat.apache.org>
Subject: Retries in mod_jk
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit

Hi,

since we committed some changes to retry handling in mod_jk I try to
sort out our various aspects of retries and how to do them best.

1) Retries when retrieving an endpoint for an AJP 13 worker
===========================================================

Getting an endpoint might fail, if there are more threads than the size
of our connection pool. This is especially likely in the IIS and
Netscape case, but could also be true if one has a lot of threads in an
httpd process (like on windows), but doesn't want to allow the same
number of connections to the backend (e.g. httpd also handles other tasks).

At the moment we handle retries for get_endpoint only when the AJP 13
worker is member of a load balancer. In this special case, this should
not be needed, because this is not the same as request failover. In this
case here, we haven't yet read any request body, so need to buffer anything.

a) Refactor
-----------

I would suggest we move the retry handling of get_endpoint into
ajp_get_endpoint() and derive any configuration options of the retry
behaviour from the AJP 13 worker, not the LB worker.

Furthermore we already have in 1.2.27 an AJP13 busy counter, so we can
handle the increment/decrement inside get_endpoint, also add a BUSY
state and can therefore fail fast in get_endpoint.

b) Configuration
----------------

In this case, we don't have any communication problem or so, it is
simply the case, that more threads are already connected to the backend,
than we expected.

So we could either

- Produce additional connections, which get immediately destroyed after
use. Problem: there might be no backend threads available to handle
those connections.

- Wait and retry a couple of times. That's what we do at the moment. In
the assumed case, that more web server threads try to talk to the
backend than we allow connections in the pool it is very likely, that
most of those additional threads will not succeed in any of those retries.

- Block for a limited time on the pool, getting notified, when a
connection is returned to the pool. That seems to be the best action,
but we don't yet have the infrastructure for blocked wait with timeout.
Mladen: would that be easy to do?

c) Load-Balancer reaction
-------------------------

How should a load balancer react to an error in get_endpoint of one of
its members? Should it put the member into an error state or not? We
mark it as being BUSY, and fail over the individual request if possible
and allowed by configuration. When a connection is returned to the pool,
we clear BUSY.


2) Retries in AJP 13 service
============================

In this case we got some problem when sending the request or receiving
the response. Typically a communication problem detected by some sort of
timeout. Broken connection reconnects are already handled transparently.

In the case were the request is recoverable and a retry is allowed by
configuration, we do retry the request on the same backend.

We allow a defined number of retries  2 attempts = 1 retry) with a fixed
sleep time in between. This seems appropriate.

Depending on the specific type of processing error returned by the AJP
13 service, a load balancer puts the member into error, or keeps it in
service. If allowed and possible it does fail over the request.


3) Retries in the Load-Balancer
===============================

If no member of a load balancer works, we also allow for retries, so we
start the whole searching for a working member from the beginning.

Actually I think this is not really helpful. If all members are in
error, it is very unlikely, that sleep and retry will yield a better
result. Such a complete failure will last for some longer time and it is
better to free the web server from the request, than insisting on retries.

Comments?

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org