tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <luke.wal...@bt.com>
Subject RE: mod_jk Problems - - worker went to error state and dont recover
Date Thu, 21 Feb 2008 09:27:31 GMT
All

Apologies, this is unrelated. How do I unsubscribe from this mailing
list, I thought it would be useful and small but its overwhelming my
inbox?

Thanks in Advance.

Luke Walshe
BT Operate, HGIPCC Technical Specialist
Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 

-----Original Message-----
From: Ahmed Musa [mailto:donald1090@gmx.at] 
Sent: 21 February 2008 09:25
To: Tomcat Users List
Subject: Re: mod_jk Problems - - worker went to error state and dont
recover

Hello Rainer,
Thanks for your informations - the Situation gets more clear now.
I will read again some dics - following your links and will make further
tests also with the improved logging.
Thanks a lot for your time
with best regards 
ahmed

-------- Original-Nachricht --------
> Datum: Wed, 20 Feb 2008 18:59:01 +0100
> Von: Rainer Jung <rainer.jung@kippdata.de>
> An: Tomcat Users List <users@tomcat.apache.org>
> Betreff: Re: mod_jk Problems - - worker went to error state and dont
recover

> Ahmed Musa wrote:
> > Hello,
> > Wow -thank you very much Rainer for your very quick and informative
> answer.
> > I will go to 1.2.26 and think about some "smoother" Values for
> reply_timeout and max_reply_timeouts.
> > I will search for the requests which causes the Problems - becasue i
> still log the response time in your mentioned way - but I am not sure
that the
> Userrequests are responsible for the Situation. 
> 
> One note: for Apache httpd 2.x %d is microseconds (there is no format 
> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
> searching for the root cause, it might make sense to have both access 
> logs active to check about duration differences.
> 
> > So one further question - does mod_jk itself checks if the Backend
is
> reachable - without userrequests? 
> 
> No. Everything only works on top of user requests.
> 
> > When there are connections to the Backend - are they closed after
the
> respone or are the hold open for further requests.
> 
> In general hold open. There are parameters on how long they are held 
> open without more requests before they get shut down, and also how
many 
> might be kept open even when no requests are coming in. Those are the 
> connection pool parameters, which you will find on
> 
> http://tomcat.apache.org/connectors-doc/reference/workers.html
> 
> Tomcat also has a connectionTimeout on the connector, which will shut 
> down a connection from the Tomcat side if it is idle for to long.
> 
> If you don't want to reuse connections at all, there's also a setting
(a 
> JkOption in Apache).
> 
> > Is it possible that the Checkpoint Firewall in Between can be
> responsible for the connectivity problem?
> 
> It can cut a connection that's idle for too long. Since you have 
> cping/cpong active via connect_timeout and prepost_timeout, you should

> get a cping error message, if the connection was dropped by the
firewall 
> during idle times and mod_jk tries to use it again. The reply timeout
in 
> the error log indicates, that the backend isn't answering. Of course
if 
> it takes *very* long to answer, it might be that the firewall dropped 
> the connection in between, but then the root cause would still be the 
> long response time of the backend.
> 
> > Another point is the "not recovering" of the worker. Yes, you are
right
> - in this situation i have many reply_timeouts - but these happens in
a
> period of time - for example 30 minutes - but the worker is still dead
even
> then when there are no more reply_timeouts. It remains dead.
> > It was necessary to restart it manually via jkstatus.
> 
> I assume you are using stickyness, so when a session started on a
node, 
> it will stay there. So when a worker is in error for a long time, all 
> new sessions will start on other nodes. If the worker is ready for 
> recovery, it needs a request, that doesn't carry a session to get
probed 
> with this request.
> 
> In jkstatus, the status of an error worker should switch to REC, when 
> mod_jk decides that it could send a non-sticky request there (to
probe) 
> and to PRB, during the time this request is on the node, and finally 
> either to OK or back to ERR depending on the result of the request.
> 
> You can log the number of errors (and accesses) that happened on the 
> node in the httpd access log. If you think that the node simply stays
in 
> error for a long time, then the error count (and access count) should 
> stay constant. I would expect, that they do not.
> 
> Have a look at how LogFormat in Apache httpd works, and then add some
of 
> those documented in
> 
> http://tomcat.apache.org/connectors-doc/reference/apache.html
> 
> like:
> 
> JK_LB_LAST_NAME
> JK_LB_LAST_ACCESSED
> JK_LB_LAST_ERRORS
> JK_LB_LAST_BUSY
> JK_LB_LAST_STATE
> 
> using the syntax %{JK_LB_LAST_STATE}n etc.
> 
> > 
> > Another point is the learning - i read the dics - the infos on the
> apache Website i dont't find other ones - are there other ones ? - and
they are
> not going in depth - if you read the spec and watch the logs it is -
for me
> - very hard to match the things. Also the many possibilities that
mod_jk
> has to prove if there is a connection to the Backend,... - i
understand them
> but check the reality in an error situation is very hard. Under
matching i
> mean "Which Part of the Communication sequence failed - why - and
causes
> which error message".
> > But i will try - and study also the mailing list..
> 
> It's hard for us too (sometimes).
> 
> > Thank you for your time - tomorrow we will have the new version and
will
> see what happens.
> > 
> > best
> > ahmed
> 
> 
> Regards,
> 
> Rainer
> 
> > -------- Original-Nachricht --------
> >> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> >> Von: Rainer Jung <rainer.jung@kippdata.de>
> >> An: Tomcat Users List <users@tomcat.apache.org>
> >> Betreff: Re: mod_jk Problems - - worker went to error state and
dont
> recover
> > 
> >> samk@twinix.com wrote:
> >>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
on
> >> behalf of a User
> >>> Hallo to all, After long unsuccessful research i hope someone can
> >>> give me a hint to the following problems.
> >>>
> >>> Our Apache-mod_jk-Tomcat Infrastructur was running without
Problems
> >>> for about one year-than since two month mod_jk errors occurs.
> >>> We upgraded the mod_jk Version, made improvements in the
> >>> worker.properties - the problems changed and get less but
sometimes
> they
> >>> appear further on.
> >>>
> >>> It seems that the mod_jk worker loose the connection to their
> >>> Tomcat-Backendserver - there are messages in the mod_jk log Files
> which
> >>> points in this direction. Normally this seems not to be a big
problem
> -
> >>> but under certain conditions (which ?) the worker goes to an error
> state
> >>> and cannot recover itself- must be done manually.
> >>>
> >>> Problem 1: The Tomcats are reachable - unknown why the workers
think
> the
> >> server is dead ?
> >>> Problem 2: I have no idea why the worker goes to an error state
and
> >> cannot recover.
> >>
> >> 2 is a consequence of 1
> >>
> >>> Problem3: I miss explanations of logged messages - i read the
messages
> -
> >> but cannot match them to the situation - when does a worker post
this
> >> messages
> >>
> >> 1 is a consequence of these messages
> >>
> >>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> >> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi

> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
waiting
> reply from
> >> tomcat. Tomcat is down, stopped or network problems (errno=110)
> >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> >> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
from
> tomcat failed with
> >> out recovery in send loop attempt=0
> >>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> >> service::jk_lb_worker.c (1105): unrecoverable error 504, request
> failed. Tomcat failed in
> >> the middle of request, we can't recover to another instance.
> >>
> >> The second line tells us, that your configured reply_timeout fired.
> >> You set it to 120000 (2 minutes), so there are requests taking
longer 
> >> than 2 minutes on the backend, before the first response packet
comes 
> >> back from the backend.
> >>
> >> With your configuration mod_jk then doesn't wait any longer on the
> reply 
> >> *and puts the backend into error mode*.
> >>
> >> Up until version 1.2.25, if you use a reply-timeout, you need to
set it
> >> to a high number which justifies the resoning "if it takes that
long, 
> >> that something is wrong with the backend".
> >>
> >> Reality shows: there is no such number. Often there are few
requests 
> >> that take unaccetably long on the backend *although* the backend is

> >> still working.
> >>
> >> So in 1.2.25 we added max_reply_timeouts. With this set in addition
to 
> >> reply_timeout, mod_jk will abort waiting for a reply after 
> >> reply_timeout, but allow some timeouts before actually deciding to
put 
> >> the backend into error.
> >>
> >> Unfortunately the implementation of max_reply_timeouts in 1.2.25
was 
> >> wrong, so you need to go to 1.2.26 to get it working right.
> >>
> >> See:
> >>
> >> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> >>
> >> Caution: this does *not* explain, why the backends are not
> automatically 
> >> recovered after a minute of error condition. Maybe you have times,
> where 
> >> you getr to many of those reply_timeouts (see log file), and
although
> we 
> >> recover after a minute the backend almost immediately goes back
into 
> >> error status.
> >>
> >>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
can i
> >> found details to errno=110 ?...
> >>
> >> reply_timeout, see above and also
> >>
> >> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> >>
> >> errno: a standard unix feature. The numbers are platform dependent.
I 
> >> would assume in your case
> >>
> >> ETIMEDOUT       110     /* Connection timed out */
> >>
> >> so no wonder, that's exactly what we expect (and doesn't tell us
the 
> >> reason, i.e. what's wrong on the *backend* taking that long for a
> >> response).
> >>
> >>> -> receiving reply from tomcat failed with out recovery in send
loop
> >> attempt=0  - ? with out recovery in send loop - means?
> >>
> >> That your configuration doesn't allow us to send the request to
another
> >> backend. recovery_options 7 include: if mod_jk was able to send the

> >> request to a backend, do not try to send it to another backend in
case 
> >> of an error during the response handling. Even if you would allow 
> >> sending to another backend, it would not help with *not* putting
the 
> >> worker into error state. More likely would be, that you would put
all 
> >> workers into error state, because all of them might run into the
same 
> >> timeout, one after the other.
> >>
> >>> -> unrecoverable error 504 - details to this error ?
> >> That's simply how we return the situation back to the client
(browser).
> >>
> >>> Ok - i turn the logging level to debug - the course of events get
> >>> more
> >>> clear - but also more questions appear - there are socket numbers
-
> >>> which sockets - what are these numbers e.g will be shutting down
> socket
> >>> 35 for worker INETP1021 - The sockets are good for ? - how many
are
> >>> there/per worker ? can i configure them ?
> >> Should not be the problem here. For apache httpd if you do *not* 
> >> configure anything, we automatically choose the number of httpd
threads
> >> as the maximum number of connections. No need to change anything
here.
> >>> => Generally -How can i solve such problems - i tried to look into
> >>> the
> >>> mod_jk code - searching for error codes, error messages - but
cannot
> >>> find some relevant informations, - i am studying the log Files -
but
> >>> don't find out what really happens.
> >> Post to the list. Improve our dics.
> >>
> >> The error message contains the word "timeout" and "reply" and you
have
> a 
> >> "reply_timeout".
> >>
> >> Long running requests are a frequent problem. If you want to get
rid of
> >> them, start by adding response times to your httpd and your tomcat 
> >> access log format (%D). Then have a look, which URLs are producing
long
> >> running requests, during what time of day are they happening etc.
This 
> >> might give you a clue about the reasons.
> >>
> >> And if they are very frequent: do Java Thread Dumps of your
backends
> and 
> >> analyze them.
> >>
> >>> So - maybe someone has an idea why the worker think that the
> >>> corresponding Tomcat is dead, and why he will not recover by
itself. !
> >> Tomecat is dead: from the point of view of mod_jk it simply means:
we 
> >> didn't get an answer, when we expected one. Details depend on the 
> >> additional log lines (could not connect, reply timeout etc.).
> >>
> >>> And i am also searching for tips how i can help myself - and where
to
> >>> find something about the error codes, messages,..in mod_jk
> >>>
> >>> thanks for your attention
> >>> Best
> >>> ahmed musa (writing from vienna)
> >>>
> >> Regards,
> >>
> >> Rainer
> >>
> >>> Current Infrastructur
> >>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> >> /Kernelversion 2.6.9-34
> >>> In front of the Webserver there are two (two Locations)
> HW-Loadbalancer
> >> (but they have no role in this story)
> >>> The Webservers are hosted at our ISP.
> >>>  
> >>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> >>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
because
> of
> >>> underlying Application-Parts the OS is Windows 2003 Server - a
long
> >>> story not worth to explain :-) ). The Tomcatserver gain Data via
> >>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> >>> Tomcatserver are Inhouse -and were rebooted nightly because of
> automated
> >>> Deployment processes.
> >>>
> >>> Between the Webserver and the Tomcatserver is a Checkpoint
Firewall. 
> >>> All webapps are deployed on all Tomcats - only mod_jk manages the
> >>> requests to certain Tomcat- instances.
> >>> (on one Bladeserver there are two identically Tomcat Instances
> >>> running).
> >>>
> >>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
against
> >>> the public Website(s) are normal short living requests - not many
-
> The
> >>> most Webapps (Portals) need a login, have a strong focus on
business
> >>> logic - so the instances are big (many MBs in RAM), the sessions
are
> >>> sticky and the session timeout is 20 minutes. But there are also
less
> >>> requests. To the User requests - Monitoring requests from our ISP
are
> >> added.
> >>> The Problems appears at Servers/Portals which very less
Userrequests.
> >>>
> >>> worker.properties
> >>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> >>>
> >>> worker.template.type=ajp13
> >>> worker.template.lbfactor=5
> >>> worker.template.socket_keepalive=1
> >>> worker.template.connect_timeout=7000
> >>> worker.template.prepost_timeout=5000
> >>> worker.template.reply_timeout=120000
> >>> worker.template.retries=6
> >>> worker.template.activation=Active
> >>> worker.template.recovery_options=7
> >>>
> >>> worker.lbtemplate.type=lb
> >>> worker.lbtemplate.max_reply_timeouts=6
> >>> worker.lbtemplate.method=Session
> >>>
> >>> #Produktions Worker
> >>> # AS-INETP101 - 106 - 6/6 GGI
> >>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> >>> worker.INETP1011.port=65001
> >>> worker.INETP1011.reference=worker.template
> >>>
> >>> ....many more of the same
> >>>
> >>> then
> >>>
> >>> worker.ajp_ad.reference=worker.lbtemplate
> >>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> >>>
> >>> .... many more portals
> >>>
> >>> at least jkstatus
> >>>
> >>> The JKMount is very simple
> >>> JkMount /* ajp_ad    --- for the other portals mostly the same
> >>>
> >>> The Portals are Virtual Hosts on the Apache.
> >>>
> >>> Tomcat - server.xml
> >>> example
> >>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> >>>     <Engine name="Catalina" jvmRoute="INETP5021"
> defaultHost="default">
> >>> ......
> >>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> >>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> >>> xmlNamespaceAware="false">
> >>>         <Alias>www.slfinsol.com</Alias>
> >>>         <Alias>web1.slfinsol.com</Alias>
> >>>         ...
> >>>         <Alias>testweb.slfinsol.com</Alias>
> >>>         .....
> >>>         <Valve
className="org.apache.catalina.valves.AccessLogValve"
> >>> directory="logs" prefix="swl_access_log." suffix=".txt"
> pattern="common"
> >>> resolveHosts="false" />
> >>>         <Valve
> >>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
> >>>         <Valve
> >>>
className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> >>>         <Context path="" docBase="swl" />
> >>>         <Context path="/monitor5" docBase="monitor" />
> >>>         <Context path="/swl" docBase="swl" />
> >>>       </Host>    
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message