Return-Path: X-Original-To: apmail-trafficserver-users-archive@www.apache.org Delivered-To: apmail-trafficserver-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 76F6517D0A for ; Tue, 6 Oct 2015 22:33:10 +0000 (UTC) Received: (qmail 61451 invoked by uid 500); 6 Oct 2015 22:33:10 -0000 Delivered-To: apmail-trafficserver-users-archive@trafficserver.apache.org Received: (qmail 61415 invoked by uid 500); 6 Oct 2015 22:33:10 -0000 Mailing-List: contact users-help@trafficserver.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@trafficserver.apache.org Delivered-To: mailing list users@trafficserver.apache.org Received: (qmail 61406 invoked by uid 99); 6 Oct 2015 22:33:09 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2015 22:33:09 +0000 Received: from [17.115.109.229] (unknown [17.115.109.229]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 7E24A1A003F for ; Tue, 6 Oct 2015 22:33:09 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3095\)) Subject: Re: TrafficServer 6, keep-alive, connection retries, and 502 Server Hangups From: James Peach In-Reply-To: <1443975393.1364867.400869481.2BFF6EEF@webmail.messagingengine.com> Date: Tue, 6 Oct 2015 15:33:07 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1443975393.1364867.400869481.2BFF6EEF@webmail.messagingengine.com> To: users@trafficserver.apache.org X-Mailer: Apple Mail (2.3095) > On Oct 4, 2015, at 9:16 AM, Nick Muerdter wrote: >=20 > Hi, >=20 > I've observed some differences in how TrafficServer 6.0.0 behaves with > connection retrying and outgoing keep-alive connections. I believe the > changes in behavior might be related to this issue: > https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure = if > the new behavior (specifically around keep-alive handling) was > intentional or not, so I thought I'd ping the mailing list. >=20 > What I'm seeing in 6.0.0 is that if TrafficServer has some backend > keep-alive connections already opened, but then one of the keep-alive > connections is closed, the next request to TrafficServer may generate = a > 502 Server Hangup response when attempting to reuse that connection. > Previously, I think TrafficServer was retrying when it encountered a > closed keep-alive connection, but that is no longer the case. So if = you > have a backend that might unexpectedly close its open keep-alive > connections, the only way I've found to completely prevent these 502 > errors in 6.0.0 is to disable outgoing keepalive > (proxy.config.http.keep_alive_enabled_out and > proxy.config.http.keep_alive_post_out settings). >=20 > For a slightly more concrete example of what can trigger this, this is > fairly easy to reproduce with the following setup: >=20 > - TrafficServer is proxying to nginx with outgoing keep-alive > connections enabled (the default). > - Throw a constant stream of requests at TrafficServer. > - While that constant stream of requests is happening, also send a > regular stream of SIGHUP commands to nginx to reload nginx. > - Eventually you'll get some 502 Server Hangup responses from > TrafficServer among your stream of requests. >=20 > SIGHUPs in nginx should result in zero downtime for new requests, but = I > think what's happening is that TrafficServer may fail when an old > keep-alived connection is reused (it's not common, so it depends on = the > timing of things and if the connection is from an old nginx worker = that > has since been shut down). In TrafficServer 5.3.1 these connection > failures were retried, but in 6.0.0, no retries occur in this case. >=20 > Here's some debug logs that show the difference in behavior between > 6.0.0 and 5.3.1. Note that differences seem to stem from how each > version eventually handles the "VC_EVENT_EOS" event following > "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE". >=20 > 5.3.1: > = https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-= log-L316 > 6.0.0: > = https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-= log-L314 >=20 > Interestingly, if I'm understand the log files correctly, it looks = like > TraffficServer is reporting an odd empty response from these = connections > ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as = I > can tell from TCP dumps on the system, nginx is not actually sending = any > form of response. >=20 > So my basic question is whether the new behavior in 6.0.0 is correct = or > not. Based on the discussion in > https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1 > retrying on these closed keep-alive connections was actually safe or > not. In these example cases the backend server isn't sending back any > data (at least as far as I can tell), so from what I understand, it > should be safe to retry. However, I'm not totally sure that this > situation with dead keep-alive connections can properly be = distinguished > between other types of hangups or connection errors, so perhaps it = isn't > safe. >=20 > If the 6.0.0 behavior is correct, is disabling outgoing keep-alive > connections the best option if I'm worried about backend services > unexpectedly killing off old keep-alive connections? Or is this a bug > with 6.0.0, and should TrafficServer retires technically be possible = in > these cases? Hi Nick, This sounds like a 6.0 regression to me. Can you file the above = information in Jira? thanks, James=