tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Cleary <>
Subject Re: 21 second pause that randomly happens
Date Mon, 16 Jul 2018 18:48:02 GMT
On 16.07.2018 16:35, David Cleary wrote:
> 2018-07-16 15:55 GMT+03:00 David Cleary <>:
>>> We have a customer who is experiencing a random, 21 second pause when using out
> based application server. We believe this may be during a TCP connect and timeout. Logging
> indicates the pause happens before the request makes it to our back end.

> Logging where then ?

Sorry for any formatting issues. I have a digest subscription which doesn't lend well to interactivity.

Clients are running on Windows machines. Server is running on AWS and Linux. There is a cloud
firewall in between (pfSense). Do not have the details if they are running the cloud version
available on AWS.

Client logging shows this:

[18/05/16@12:12:48.822+1000] P-006760 T-002372 1 4GL REV            Trying Connection
[18/05/16@12:13:09.925+1000] P-006760 T-002372 1 4GL REV            Connect Complete
[18/05/16@12:13:09.925+1000] P-006760 T-002372 1 4GL REV            WARNING: LONG
[18/05/16@12:13:09.925+1000] P-006760 T-002372 1 4GL REV            A4DC513EA548E24508E1E90AA9EA61DD9386DDB475AD.clintons
connected 21102

Localhost access log shows this

localhost_access_log.2018-05-16.txt: - - [16/May/2018:12:13:16 +1000] "POST /apsv?CONNHDL=A4DC513EA548E24508E1E90AA9EA61DD9386DDB475AD.clintons
HTT P/1.1" 200 253 1

And our back end agent log shows this:

clintons.agent.log:[18/05/16@12:13:16.294+1000] P-019364 T-2819262208 2 AS-19 AS Application
Server connected with connection id: A4DC513EA548E24508E1E90AA9EA61 DD9386DDB475AD.clintons.
clintons.agent.log:[18/05/16@12:13:16.299+1000] P-019364 T-3688318720 2 AS-19 AS Application
Server disconnected with connection id: A4DC513EA548E24508E1E90AA9E A61DD9386DDB475AD.clintons.

Customer had some weird reconnection logic that was part of their application. After removing
the code so the logical connection would be kept open, we saw this pause happen on a standard
request. I do not know how long this logical connection was idle before running. I also do
not know if Tomcat closed the underlying socket either due to resources or a keep-alive timeout.
I was hoping logging could tell me when Tomcat binds to an incoming socket and releases it.
I was hoping to show in the above example, as far as Tomcat is concerned, the 21 second delay
happened outside of the server. Scouring the source code and trying some experimentation,
it does not appear there is logging available at the socket level.

  It mostly happens
> when we create an initial logical connection, but we have also seen it elsewhere where
> believe the TCP Keep alive was expired and a new socket had to be established. However,
> do not know this and am hoping there is some logging I can turn on in the NIO connector
> collect more data. I tried turning on logging in the Endpoint class, but that did not
> anything useful.

> If the connection request does not even reach the Tomcat back-end, that is also unlikely

> to provide much information. (Not being facetious here, just stating a fact).
> Can you do a "netstat" command on your Tomcat server when this happens ?
> If yes, maybe some part of the output would provide some information from the TCP level

> (such as a high number of connections, to the Tomcat NIO port, in some specific TCP state


Customer did some probing with Wireshark and said they were seeing a TCP retransmission (sorry,
I do not have many more details). In investigating this, we discovered this info on TCP timeouts:

" There's probably a million reasons why the client may never receive a SYN-ACK. The one I've
seen more often is packet loss, which in turn can have lots of reasons, for example a malfunctioning
or misconfigured network switch.
However, you can immediately spot if your timeout/hang problems are caused by TCP retransmission
because they happen to cause response times that are unusually frequently distributed around
3, 9 and 21 seconds (and on, of course).
In fact, the TCP retransmission timeout starts at 3 seconds, but if the client tries to resend
after a timeout and still receives no answer, it doubles the wait to 6 s, so the total response
time will be 9 seconds, assuming that the client now finally receives the SYN-ACK. Otherwise,
3 + 6 + 12 = 21, then 3 + 6 + 12 + 24 = 45 s and so on and so forth."

This is why we are focusing on the TCP layer.

>  There is a NAT firewall between the client and server, so I'm looking for
> some TCP level logging that could point me in the proper direction.
>> Tomcat version = ?
> Sorry. Tomcat 8.5.27.

> And on which kind of O.S. is this happening ?

> Also maybe another question : is this happening on a Tomcat server which is dedicated
> that particular customer ? or is the Tomcat server shared between different customers,

> only that particular customer experiences these delays ?

We sell an application server that customers create their own applications on. This particular
customer has many customers themselves. The customer's application do not exhibit this running
on our older, non-Tomcat based AppServer. Since this happens randomly, only a couple of times
a day, it is difficult to diagnose. Since the customer doesn't see this issue, running the
same exact client, on our older AppServer, they believe it is the new one. However, the older
one isn't HTTP-based, and they had a bunch of hacks related to connections where they would
recycle a connection after 5 minutes of inactivity, or after a 30 minute lifespan. This says
to me the issue clearly isn't our appserver, but I can't prove it at this point. The firewall
is my likely culprit, but without logging at the Tomcat endpoint, I can't definitively say
where the pause is.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message