Return-Path: Delivered-To: apmail-tomcat-dev-archive@www.apache.org Received: (qmail 4866 invoked from network); 5 Oct 2008 14:26:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Oct 2008 14:26:12 -0000 Received: (qmail 63930 invoked by uid 500); 5 Oct 2008 14:26:09 -0000 Delivered-To: apmail-tomcat-dev-archive@tomcat.apache.org Received: (qmail 63863 invoked by uid 500); 5 Oct 2008 14:26:09 -0000 Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Tomcat Developers List" Delivered-To: mailing list dev@tomcat.apache.org Received: (qmail 63852 invoked by uid 99); 5 Oct 2008 14:26:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Oct 2008 07:26:09 -0700 X-ASF-Spam-Status: No, hits=-3.9 required=10.0 tests=DNS_FROM_SECURITYSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [195.227.30.149] (HELO mailserver.kippdata.de) (195.227.30.149) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Oct 2008 14:25:07 +0000 Received: from [192.168.2.134] ([192.168.2.134]) by mailserver.kippdata.de (8.13.5/8.13.5) with ESMTP id m95EPdkQ002626 for ; Sun, 5 Oct 2008 16:25:40 +0200 (CEST) Message-ID: <48E8CE60.30100@kippdata.de> Date: Sun, 05 Oct 2008 16:25:36 +0200 From: Rainer Jung User-Agent: Thunderbird 2.0.0.17 (Windows/20080914) MIME-Version: 1.0 To: Tomcat Developers List Subject: Retries in mod_jk Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, since we committed some changes to retry handling in mod_jk I try to sort out our various aspects of retries and how to do them best. 1) Retries when retrieving an endpoint for an AJP 13 worker =========================================================== Getting an endpoint might fail, if there are more threads than the size of our connection pool. This is especially likely in the IIS and Netscape case, but could also be true if one has a lot of threads in an httpd process (like on windows), but doesn't want to allow the same number of connections to the backend (e.g. httpd also handles other tasks). At the moment we handle retries for get_endpoint only when the AJP 13 worker is member of a load balancer. In this special case, this should not be needed, because this is not the same as request failover. In this case here, we haven't yet read any request body, so need to buffer anything. a) Refactor ----------- I would suggest we move the retry handling of get_endpoint into ajp_get_endpoint() and derive any configuration options of the retry behaviour from the AJP 13 worker, not the LB worker. Furthermore we already have in 1.2.27 an AJP13 busy counter, so we can handle the increment/decrement inside get_endpoint, also add a BUSY state and can therefore fail fast in get_endpoint. b) Configuration ---------------- In this case, we don't have any communication problem or so, it is simply the case, that more threads are already connected to the backend, than we expected. So we could either - Produce additional connections, which get immediately destroyed after use. Problem: there might be no backend threads available to handle those connections. - Wait and retry a couple of times. That's what we do at the moment. In the assumed case, that more web server threads try to talk to the backend than we allow connections in the pool it is very likely, that most of those additional threads will not succeed in any of those retries. - Block for a limited time on the pool, getting notified, when a connection is returned to the pool. That seems to be the best action, but we don't yet have the infrastructure for blocked wait with timeout. Mladen: would that be easy to do? c) Load-Balancer reaction ------------------------- How should a load balancer react to an error in get_endpoint of one of its members? Should it put the member into an error state or not? We mark it as being BUSY, and fail over the individual request if possible and allowed by configuration. When a connection is returned to the pool, we clear BUSY. 2) Retries in AJP 13 service ============================ In this case we got some problem when sending the request or receiving the response. Typically a communication problem detected by some sort of timeout. Broken connection reconnects are already handled transparently. In the case were the request is recoverable and a retry is allowed by configuration, we do retry the request on the same backend. We allow a defined number of retries 2 attempts = 1 retry) with a fixed sleep time in between. This seems appropriate. Depending on the specific type of processing error returned by the AJP 13 service, a load balancer puts the member into error, or keeps it in service. If allowed and possible it does fail over the request. 3) Retries in the Load-Balancer =============================== If no member of a load balancer works, we also allow for retries, so we start the whole searching for a working member from the beginning. Actually I think this is not really helpful. If all members are in error, it is very unlikely, that sleep and retry will yield a better result. Such a complete failure will last for some longer time and it is better to free the web server from the request, than insisting on retries. Comments? Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org For additional commands, e-mail: dev-help@tomcat.apache.org