tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rainer Jung <rainer.j...@kippdata.de>
Subject Re: mod_jk Problems - - worker went to error state and dont recover
Date Wed, 20 Feb 2008 14:56:42 GMT
samk@twinix.com wrote:
> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on behalf of a User
> 
> Hallo to all, After long unsuccessful research i hope someone can
> give me a hint to the following problems.
> 
> Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
> for about one year-than since two month mod_jk errors occurs.
> We upgraded the mod_jk Version, made improvements in the
> worker.properties - the problems changed and get less but sometimes they
> appear further on.
> 
> It seems that the mod_jk worker loose the connection to their
> Tomcat-Backendserver - there are messages in the mod_jk log Files which
> points in this direction. Normally this seems not to be a big problem -
> but under certain conditions (which ?) the worker goes to an error state
> and cannot recover itself- must be done manually.
> 
> Problem 1: The Tomcats are reachable - unknown why the workers think the server is dead
?
> Problem 2: I have no idea why the worker goes to an error state and cannot recover.

2 is a consequence of 1

> Problem3: I miss explanations of logged messages - i read the messages - but cannot match
them to the situation - when does a worker post this messages

1 is a consequence of these messages

> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info] jk_handler::mod_jk.c (2270):
Aborting connection for worker=ajp_ggi 
> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_get_reply::jk_ajp_common.c
(1623): (INETP1011) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network
problems (errno=110)
> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_service::jk_ajp_common.c
(2034): (INETP1011) receiving reply from tomcat failed with out recovery in send loop attempt=0
> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error] service::jk_lb_worker.c (1105):
unrecoverable error 504, request failed. Tomcat failed in the middle of request, we can't
recover to another instance.

The second line tells us, that your configured reply_timeout fired.
You set it to 120000 (2 minutes), so there are requests taking longer 
than 2 minutes on the backend, before the first response packet comes 
back from the backend.

With your configuration mod_jk then doesn't wait any longer on the reply 
*and puts the backend into error mode*.

Up until version 1.2.25, if you use a reply-timeout, you need to set it 
to a high number which justifies the resoning "if it takes that long, 
that something is wrong with the backend".

Reality shows: there is no such number. Often there are few requests 
that take unaccetably long on the backend *although* the backend is 
still working.

So in 1.2.25 we added max_reply_timeouts. With this set in addition to 
reply_timeout, mod_jk will abort waiting for a reply after 
reply_timeout, but allow some timeouts before actually deciding to put 
the backend into error.

Unfortunately the implementation of max_reply_timeouts in 1.2.25 was 
wrong, so you need to go to 1.2.26 to get it working right.

See:

http://issues.apache.org/bugzilla/show_bug.cgi?id=43229

Caution: this does *not* explain, why the backends are not automatically 
recovered after a minute of error condition. Maybe you have times, where 
you getr to many of those reply_timeouts (see log file), and although we 
recover after a minute the backend almost immediately goes back into 
error status.

> -> Which Timeout - how does mod_jk think Tomcat is down ? Where can i found details
to errno=110 ?...

reply_timeout, see above and also

http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html

errno: a standard unix feature. The numbers are platform dependent. I 
would assume in your case

ETIMEDOUT       110     /* Connection timed out */

so no wonder, that's exactly what we expect (and doesn't tell us the 
reason, i.e. what's wrong on the *backend* taking that long for a response).

> -> receiving reply from tomcat failed with out recovery in send loop attempt=0  -
? with out recovery in send loop - means?

That your configuration doesn't allow us to send the request to another 
backend. recovery_options 7 include: if mod_jk was able to send the 
request to a backend, do not try to send it to another backend in case 
of an error during the response handling. Even if you would allow 
sending to another backend, it would not help with *not* putting the 
worker into error state. More likely would be, that you would put all 
workers into error state, because all of them might run into the same 
timeout, one after the other.

> -> unrecoverable error 504 - details to this error ?

That's simply how we return the situation back to the client (browser).

> 
> Ok - i turn the logging level to debug - the course of events get
> more
> clear - but also more questions appear - there are socket numbers -
> which sockets - what are these numbers e.g will be shutting down socket
> 35 for worker INETP1021 - The sockets are good for ? - how many are
> there/per worker ? can i configure them ?

Should not be the problem here. For apache httpd if you do *not* 
configure anything, we automatically choose the number of httpd threads 
as the maximum number of connections. No need to change anything here.
> 
> => Generally -How can i solve such problems - i tried to look into
> the
> mod_jk code - searching for error codes, error messages - but cannot
> find some relevant informations, - i am studying the log Files - but
> don't find out what really happens.

Post to the list. Improve our dics.

The error message contains the word "timeout" and "reply" and you have a 
"reply_timeout".

Long running requests are a frequent problem. If you want to get rid of 
them, start by adding response times to your httpd and your tomcat 
access log format (%D). Then have a look, which URLs are producing long 
running requests, during what time of day are they happening etc. This 
might give you a clue about the reasons.

And if they are very frequent: do Java Thread Dumps of your backends and 
analyze them.

> So - maybe someone has an idea why the worker think that the
> corresponding Tomcat is dead, and why he will not recover by itself. !

Tomecat is dead: from the point of view of mod_jk it simply means: we 
didn't get an answer, when we expected one. Details depend on the 
additional log lines (could not connect, reply timeout etc.).

> And i am also searching for tips how i can help myself - and where to
> find something about the error codes, messages,..in mod_jk
> 
> thanks for your attention
> Best
> ahmed musa (writing from vienna)
>

Regards,

Rainer

> Current Infrastructur
> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3 /Kernelversion 2.6.9-34
> In front of the Webserver there are two (two Locations) HW-Loadbalancer (but they have
no role in this story)
> The Webservers are hosted at our ISP.
>  
> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of
> underlying Application-Parts the OS is Windows 2003 Server - a long
> story not worth to explain :-) ). The Tomcatserver gain Data via
> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> Tomcatserver are Inhouse -and were rebooted nightly because of automated
> Deployment processes.
> 
> Between the Webserver and the Tomcatserver is a Checkpoint Firewall. 
> All webapps are deployed on all Tomcats - only mod_jk manages the
> requests to certain Tomcat- instances.
> (on one Bladeserver there are two identically Tomcat Instances
> running).
> 
> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
> the public Website(s) are normal short living requests - not many - The
> most Webapps (Portals) need a login, have a strong focus on business
> logic - so the instances are big (many MBs in RAM), the sessions are
> sticky and the session timeout is 20 minutes. But there are also less
> requests. To the User requests - Monitoring requests from our ISP are added.
> The Problems appears at Servers/Portals which very less Userrequests.
> 
> worker.properties
> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> 
> worker.template.type=ajp13
> worker.template.lbfactor=5
> worker.template.socket_keepalive=1
> worker.template.connect_timeout=7000
> worker.template.prepost_timeout=5000
> worker.template.reply_timeout=120000
> worker.template.retries=6
> worker.template.activation=Active
> worker.template.recovery_options=7
> 
> worker.lbtemplate.type=lb
> worker.lbtemplate.max_reply_timeouts=6
> worker.lbtemplate.method=Session
> 
> #Produktions Worker
> # AS-INETP101 - 106 - 6/6 GGI
> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> worker.INETP1011.port=65001
> worker.INETP1011.reference=worker.template
> 
> ....many more of the same
> 
> then
> 
> worker.ajp_ad.reference=worker.lbtemplate
> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> 
> .... many more portals
> 
> at least jkstatus
> 
> The JKMount is very simple
> JkMount /* ajp_ad    --- for the other portals mostly the same
> 
> The Portals are Virtual Hosts on the Apache.
> 
> Tomcat - server.xml
> example
> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>     <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
> ......
> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> xmlNamespaceAware="false">
>         <Alias>www.slfinsol.com</Alias>
>         <Alias>web1.slfinsol.com</Alias>
>         ...
>         <Alias>testweb.slfinsol.com</Alias>
>         .....
>         <Valve className="org.apache.catalina.valves.AccessLogValve"
> directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
> resolveHosts="false" />
>         <Valve
> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>         <Valve
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>         <Context path="" docBase="swl" />
>         <Context path="/monitor5" docBase="monitor" />
>         <Context path="/swl" docBase="swl" />
>       </Host>    

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message