Mailing-List: contact users-help@trafficserver.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@trafficserver.apache.org
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3095\))
Subject: Re: TrafficServer 6, keep-alive, connection retries,
 and 502 Server Hangups
From: James Peach <jpeach@apache.org>
In-Reply-To: 
 <1443975393.1364867.400869481.2BFF6EEF@webmail.messagingengine.com>
Date: Tue, 6 Oct 2015 15:33:07 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <BA85D5A2-8B29-44A9-ACDC-E7FA8D21FC69@apache.org>
References: 
 <1443975393.1364867.400869481.2BFF6EEF@webmail.messagingengine.com>
To: users@trafficserver.apache.org


> On Oct 4, 2015, at 9:16 AM, Nick Muerdter <stuff@nickm.org> wrote:
>=20
> Hi,
>=20
> I've observed some differences in how TrafficServer 6.0.0 behaves with
> connection retrying and outgoing keep-alive connections. I believe the
> changes in behavior might be related to this issue:
> https://issues.apache.org/jira/browse/TS-3440 However, I wasn't sure =
if
> the new behavior (specifically around keep-alive handling) was
> intentional or not, so I thought I'd ping the mailing list.
>=20
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend
> keep-alive connections already opened, but then one of the keep-alive
> connections is closed, the next request to TrafficServer may generate =
a
> 502 Server Hangup response when attempting to reuse that connection.
> Previously, I think TrafficServer was retrying when it encountered a
> closed keep-alive connection, but that is no longer the case. So if =
you
> have a backend that might unexpectedly close its open keep-alive
> connections, the only way I've found to completely prevent these 502
> errors in 6.0.0 is to disable outgoing keepalive
> (proxy.config.http.keep_alive_enabled_out and
> proxy.config.http.keep_alive_post_out settings).
>=20
> For a slightly more concrete example of what can trigger this, this is
> fairly easy to reproduce with the following setup:
>=20
> - TrafficServer is proxying to nginx with outgoing keep-alive
> connections enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a
> regular stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from
> TrafficServer among your stream of requests.
>=20
> SIGHUPs in nginx should result in zero downtime for new requests, but =
I
> think what's happening is that TrafficServer may fail when an old
> keep-alived connection is reused (it's not common, so it depends on =
the
> timing of things and if the connection is from an old nginx worker =
that
> has since been shut down). In TrafficServer 5.3.1 these connection
> failures were retried, but in 6.0.0, no retries occur in this case.
>=20
> Here's some debug logs that show the difference in behavior between
> 6.0.0 and 5.3.1. Note that differences seem to stem from how each
> version eventually handles the "VC_EVENT_EOS" event following
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
>=20
> 5.3.1:
> =
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-=
log-L316
> 6.0.0:
> =
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-=
log-L314
>=20
> Interestingly, if I'm understand the log files correctly, it looks =
like
> TraffficServer is reporting an odd empty response from these =
connections
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as =
I
> can tell from TCP dumps on the system, nginx is not actually sending =
any
> form of response.
>=20
> So my basic question is whether the new behavior in 6.0.0 is correct =
or
> not. Based on the discussion in
> https://issues.apache.org/jira/browse/TS-3440 I'm unsure whether 5.3.1
> retrying on these closed keep-alive connections was actually safe or
> not. In these example cases the backend server isn't sending back any
> data (at least as far as I can tell), so from what I understand, it
> should be safe to retry. However, I'm not totally sure that this
> situation with dead keep-alive connections can properly be =
distinguished
> between other types of hangups or connection errors, so perhaps it =
isn't
> safe.
>=20
> If the 6.0.0 behavior is correct, is disabling outgoing keep-alive
> connections the best option if I'm worried about backend services
> unexpectedly killing off old keep-alive connections? Or is this a bug
> with 6.0.0, and should TrafficServer retires technically be possible =
in
> these cases?

Hi Nick,

This sounds like a 6.0 regression to me. Can you file the above =
information in Jira?

thanks,
James=