tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rainer Jung <rainer.j...@kippdata.de>
Subject Re: IIS 6.0 / JK1.2.25 / Tomcat 5.5.20 - "Service temporary unavailable"
Date Wed, 02 Jul 2008 19:29:27 GMT
Hi Jesse,

Jesse Klaasse schrieb:
> I have migrated a customer's server system to the following
> configuration:
>  
> MS IIS 6.0 (port 80)
> Unlimited connections
> Connection timeout: 120s keep-alive
>  
> JK 1.2.25 AJP/1.3 Connector
> isapi_redirect.properties:
> uri_select=parsed
>  
> workers.properties:
> connect_timeout=10000
> prepost_timeout=10000
> socket_timeout=10
> connection_pool_timeout=60
> connection_pool_size=350
> 
> Apache Tomcat 5.5.20 (ports 8080 / 8009)
> Apache Portable Runtime tcnative-1.dll 1.1.12 installed
> Java options: -XX:MaxPermSize=512m -Xloggc:D:\logs\gc\tomcat-gc.log
> -XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+UseTLAB -XX:+CMSClassUnloadingEnabled
> -XX:+CMSPermGenSweepingEnabled -Xms4096m -Xmx5120m
> 
> server.xml connector 8009:
> acceptCount="100" maxThreads="400" minSpareThreads="25"
> maxSpareThreads="75" connectionTimeout="60000"
> 
> This is a dual-core Xeon machine running Windows Server 2003 x64 with 6
> GB RAM and Sun Java JDK 1.5.
> 
> I have to reboot Tomcat every night, and even then I have to restart
> Tomcat manually almost every day because of a "Service temporary
> unavailable" message. When Tomcat is hanging, JK Status Manager shows a
> Bad/Stopped worker. A reset of the worker doesn't fix the problem,
> neither does a restart of IIS. Only restarting Tomcat fixes the problem.

Double check: The worker is a member of a load balancer. the member is 
*not* in state STOP (because that is a configuration state) but in ERROR 
(which is a runtime detected state).

If this state doesn't go away, we have a persistent comminucation 
problem with Tomcat (which is what your log files confirm).

In order to analyze, why this is, we can check the state of Tomcat 
itself, and we can think about the state of the communication between 
IIS and Tomcat.

First: you don't use a reply_timeout? At this stage you shouldn't, just 
want to make sure.

Regarding error codes: the two digit error codes XX (like 60 or 53) 
corrspond to winsock error codes 100XX. Those are listed e.g. on

http://support.microsoft.com/kb/819124

53=10053=WSAECONNABORTED
60=10060=WSAETIMEDOUT

Now what are possible reasons, and how to analyze?

1) Check the internal status of Tomcat, when the problem shows up. For 
this make a couple of thread dumps, e.g. three thread dumps each about 3 
seconds apart from each other. Thread dumps contain the information, in 
which method stack each Tomcat thread is. So we can see, how many Tomcat 
threads actually are working on a request, and often we can only see, 
what type of request, if they are waiting for database responses etc.

With this test we can check whether there is congestion inside Tomcat, 
or if the problem lies in the communication layer between IIS and Tomcat.

How to do thread dumps: if Tomcat is running from a DOS box, you can use 
CTRL-Break on the keyboard (and the dumps go directly to the DOS box), 
if it is running as a service, there is an entry in the context menue of 
the tomcat monitor icon (system tray), and the dumps go to the service 
log file.

You can post the Dumps here, maybe it's better to open an issue, because 
then you can attach the files to Bugzilla.

2) Assume, that Tomcat seems to be mostly idle, once the problem shows up.

Then we would need to understand the communication layer:

Use "netstat -an" on the IIS system and the Tomcat system (if they are 
not the same) to produce a list of TCP connections and their state.

If possible use wireshark to produce a full packet dump of the 
communications between the two for a minute or so, namely long enough, 
that the cited log message occur a few times.

3) General suggestion ("shot in the dark")

You could do a simple test to quickly check:

- remove the socket_timeout
and
- remove the APR connector (tcnative)

If this solves the problem, check, if removing only of of them suffices.
If this quick test indicates APR connector as problematic, upgrade to 
1.1.13 (or the soon to appear 1.1.14).


> The isapi_redirect log contains a lot of these lines:
> [Wed Jul 02 16:19:16.908 2008] [11604:13424] [info]
> ajp_send_request::jk_ajp_common.c (1265): (ajp13w) all endpoints are
> disconnected
> [Wed Jul 02 16:19:16.908 2008] [11604:13424] [info]
> ajp_service::jk_ajp_common.c (2085): (ajp13w) sending request to tomcat
> failed,  recoverable operation attempt=1

Those are only higher level messages, not indicating the root cause.

> I have also tried the 1.2.26 connector. It seemed like it needed a
> restart even sooner with that one. These lines were in

The log information in 1.2.26 should be more precise though. At least 
for me ;)

> isapi_redirect.log a lot of times with the 1.2.26 version:
> [Wed Jul 02 15:44:51.040 2008] [11604:12396] [error]
> ajp_connection_tcp_get_message::jk_ajp_common.c (966): (ajp13w) can't
> receive the response message from tomcat, network problems or tomcat
> (127.0.0.1:8009) is down (errno=60)

The line number tells me, that is still 1.2.25, but I assume you get the 
same message text for 1.2.26.

Here I guess: since there was no reply_timeout set, the socket_timeout 
fires after 10 seconds, aborts the wait and resets the connection. If 
you can log response times with IIS, you could check, if they are above 
10 seconds. You can also log response times with an appropriate 
JkRequestLogFormat.

You could set a reply timeout to a huge value, like eg. 60 seconds, if 
you think that even under load *all* requests should return in less than 
60 seconds. We can optimize this setting later (with max_reply_timeouts 
in 1.2.26).

You could try TCP tuning like in

http://support.microsoft.com/kb/191143

but I doubt, that this will resolve the root consequence.

> [Wed Jul 02 15:44:51.040 2008] [11604:12396] [error]
> ajp_get_reply::jk_ajp_common.c (1658): (ajp13w) Tomcat is down or
> refused connection. No response has been sent to the client (yet)
> [Wed Jul 02 15:44:51.040 2008] [11604:12396] [info]
> ajp_service::jk_ajp_common.c (2046): (ajp13w) receiving from tomcat
> failed, recoverable operation attempt=0
> [Wed Jul 02 15:44:51.040 2008] [11604:12396] [info]
> ajp_service::jk_ajp_common.c (2085): (ajp13w) sending request to tomcat
> failed,  recoverable operation attempt=1

Those are followup messages. Next line is for another thread ID, so it 
comes from handling another request. It can't even connect, so I expect 
all threads to be busy (see thread dump).

> [Wed Jul 02 15:44:51.946 2008] [11604:13156] [error]
> ajp_service::jk_ajp_common.c (2097): (ajp13w) Connecting to tomcat
> failed. Tomcat is probably not started or is listening on the wrong port
> [Wed Jul 02 15:44:51.962 2008] [11604:13156] [info]
> service::jk_lb_worker.c (1128): service failed, worker ajp13w is in
> error state
> [Wed Jul 02 15:44:51.962 2008] [11604:13156] [info]
> service::jk_lb_worker.c (1191): All tomcat instances are busy or in
> error state
> [Wed Jul 02 15:44:51.962 2008] [11604:13156] [error]
> HttpExtensionProc::jk_isapi_plugin.c (1537): service() failed with http
> error 503

Included from your other post:

> Additional information:
> When using version 1.2.26 of the connector, these messages appear in
> isapi_redirect.log before the messages I quoted in my previous mail:
> 
> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [info] service::jk_lb_worker.c
> (1221): Forcing recovery once for 1 workers
> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
> ajp_connection_tcp_send_message::jk_ajp_common.c (965): sendfull returned
> -53 (errno=53)

Aha, if this is really coming before the error "60", then you should aso 
look at:

http://support.microsoft.com/kb/931319/

Unfortunately the description is not very technically precise.

> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [info]
> ajp_handle_cping_cpong::jk_ajp_common.c (803): can't send cping query
> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
> ajp_connect_to_endpoint::jk_ajp_common.c (904): (ajp13w) cping/cpong after
> connecting to the backend server failed (errno=53)
> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
> ajp_send_request::jk_ajp_common.c (1359): (ajp13w) connecting to backend
> failed. Tomcat is probably not started or is listening on the wrong port
> (errno=53)
> [Wed Jul 02 15:39:34.690 2008] [11800:10604] [info]
> ajp_service::jk_ajp_common.c (2186): (ajp13w) sending request to tomcat
> failed (recoverable), because of error during request sending (attempt=1)
...

> Nothing important seems to be in the stdout logs right before the
> problems occur. Only when I was using 1.2.26 of the connector, lots of
> these messages show up in the stdout:
> 
> Jul 2, 2008 1:14:17 PM org.apache.jk.common.ChannelSocket
> processConnection
> WARNING: processCallbacks status 2
> Jul 2, 2008 1:14:17 PM org.apache.jk.core.MsgContext action
> WARNING: Error sending end packet
> java.net.SocketException: Software caused connection abort: socket write
> error
> 	at java.net.SocketOutputStream.socketWrite0(Native Method)
> 	at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> 	at
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> 	at
> org.apache.jk.common.ChannelSocket.send(ChannelSocket.java:531)
> 	at
> org.apache.jk.common.JkInputStream.endMessage(JkInputStream.java:112)
> 	at org.apache.jk.core.MsgContext.action(MsgContext.java:293)
> 	at org.apache.coyote.Response.action(Response.java:182)
> 	at org.apache.coyote.Response.finish(Response.java:304)
> 	at
> org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:281)
> 	at
> org.apache.catalina.connector.Response.finishResponse(Response.java:476)
> 	at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1
> 51)
> 	at
> org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199)
> 	at
> org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282)
> 	at
> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:767)
> 	at
> org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:
> 697)
> 	at
> org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.
> java:889)
> 	at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool
> .java:684)
> 	at java.lang.Thread.run(Thread.java:595)

If I'm right above, and the socket_timeout closes the waiting for 
response, then Tomcat can't send back the response, once the Timeout fired.

> I'm starting to be pretty desperate, after almost two weeks of daily
> reboots and an angry customer.
> A suggestion, anyone? Thanks in advance.

Maybe too many suggestions and not a straight solution, but if you are 
able to collect more information, we should be able to sort this out.

Do others have the same issue on Windows? Did they find a solution?

Regards,

Rainer

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message