tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Musa" <donald1...@gmx.at>
Subject Re: RE: mod_jk Problems - - worker went to error state and dont recover
Date Thu, 21 Feb 2008 09:32:31 GMT
Hallo Luke,

Here the information from tomcat.apache.org

Unsubscription: Send a blank email to  users-unsubscribe@tomcat.apache.org
Digest unsubscription: 	Send a blank email to users-digest-unsubscribe@tomcat.apache.org

best ahmed

-------- Original-Nachricht --------
> Datum: Thu, 21 Feb 2008 09:27:31 -0000
> Von: luke.walshe@bt.com
> An: users@tomcat.apache.org
> Betreff: RE: mod_jk Problems - - worker went to error state and dont recover

> All
> 
> Apologies, this is unrelated. How do I unsubscribe from this mailing
> list, I thought it would be useful and small but its overwhelming my
> inbox?
> 
> Thanks in Advance.
> 
> Luke Walshe
> BT Operate, HGIPCC Technical Specialist
> Telephone: +44 (0)1314483482, Email: Luke.Walshe@bt.com 
> 
> -----Original Message-----
> From: Ahmed Musa [mailto:donald1090@gmx.at] 
> Sent: 21 February 2008 09:25
> To: Tomcat Users List
> Subject: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> Hello Rainer,
> Thanks for your informations - the Situation gets more clear now.
> I will read again some dics - following your links and will make further
> tests also with the improved logging.
> Thanks a lot for your time
> with best regards 
> ahmed
> 
> -------- Original-Nachricht --------
> > Datum: Wed, 20 Feb 2008 18:59:01 +0100
> > Von: Rainer Jung <rainer.jung@kippdata.de>
> > An: Tomcat Users List <users@tomcat.apache.org>
> > Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> > Ahmed Musa wrote:
> > > Hello,
> > > Wow -thank you very much Rainer for your very quick and informative
> > answer.
> > > I will go to 1.2.26 and think about some "smoother" Values for
> > reply_timeout and max_reply_timeouts.
> > > I will search for the requests which causes the Problems - becasue i
> > still log the response time in your mentioned way - but I am not sure
> that the
> > Userrequests are responsible for the Situation. 
> > 
> > One note: for Apache httpd 2.x %d is microseconds (there is no format 
> > for milliseconds), for Tomcat %D is milliseconds. As long as you are 
> > searching for the root cause, it might make sense to have both access 
> > logs active to check about duration differences.
> > 
> > > So one further question - does mod_jk itself checks if the Backend
> is
> > reachable - without userrequests? 
> > 
> > No. Everything only works on top of user requests.
> > 
> > > When there are connections to the Backend - are they closed after
> the
> > respone or are the hold open for further requests.
> > 
> > In general hold open. There are parameters on how long they are held 
> > open without more requests before they get shut down, and also how
> many 
> > might be kept open even when no requests are coming in. Those are the 
> > connection pool parameters, which you will find on
> > 
> > http://tomcat.apache.org/connectors-doc/reference/workers.html
> > 
> > Tomcat also has a connectionTimeout on the connector, which will shut 
> > down a connection from the Tomcat side if it is idle for to long.
> > 
> > If you don't want to reuse connections at all, there's also a setting
> (a 
> > JkOption in Apache).
> > 
> > > Is it possible that the Checkpoint Firewall in Between can be
> > responsible for the connectivity problem?
> > 
> > It can cut a connection that's idle for too long. Since you have 
> > cping/cpong active via connect_timeout and prepost_timeout, you should
> 
> > get a cping error message, if the connection was dropped by the
> firewall 
> > during idle times and mod_jk tries to use it again. The reply timeout
> in 
> > the error log indicates, that the backend isn't answering. Of course
> if 
> > it takes *very* long to answer, it might be that the firewall dropped 
> > the connection in between, but then the root cause would still be the 
> > long response time of the backend.
> > 
> > > Another point is the "not recovering" of the worker. Yes, you are
> right
> > - in this situation i have many reply_timeouts - but these happens in
> a
> > period of time - for example 30 minutes - but the worker is still dead
> even
> > then when there are no more reply_timeouts. It remains dead.
> > > It was necessary to restart it manually via jkstatus.
> > 
> > I assume you are using stickyness, so when a session started on a
> node, 
> > it will stay there. So when a worker is in error for a long time, all 
> > new sessions will start on other nodes. If the worker is ready for 
> > recovery, it needs a request, that doesn't carry a session to get
> probed 
> > with this request.
> > 
> > In jkstatus, the status of an error worker should switch to REC, when 
> > mod_jk decides that it could send a non-sticky request there (to
> probe) 
> > and to PRB, during the time this request is on the node, and finally 
> > either to OK or back to ERR depending on the result of the request.
> > 
> > You can log the number of errors (and accesses) that happened on the 
> > node in the httpd access log. If you think that the node simply stays
> in 
> > error for a long time, then the error count (and access count) should 
> > stay constant. I would expect, that they do not.
> > 
> > Have a look at how LogFormat in Apache httpd works, and then add some
> of 
> > those documented in
> > 
> > http://tomcat.apache.org/connectors-doc/reference/apache.html
> > 
> > like:
> > 
> > JK_LB_LAST_NAME
> > JK_LB_LAST_ACCESSED
> > JK_LB_LAST_ERRORS
> > JK_LB_LAST_BUSY
> > JK_LB_LAST_STATE
> > 
> > using the syntax %{JK_LB_LAST_STATE}n etc.
> > 
> > > 
> > > Another point is the learning - i read the dics - the infos on the
> > apache Website i dont't find other ones - are there other ones ? - and
> they are
> > not going in depth - if you read the spec and watch the logs it is -
> for me
> > - very hard to match the things. Also the many possibilities that
> mod_jk
> > has to prove if there is a connection to the Backend,... - i
> understand them
> > but check the reality in an error situation is very hard. Under
> matching i
> > mean "Which Part of the Communication sequence failed - why - and
> causes
> > which error message".
> > > But i will try - and study also the mailing list..
> > 
> > It's hard for us too (sometimes).
> > 
> > > Thank you for your time - tomorrow we will have the new version and
> will
> > see what happens.
> > > 
> > > best
> > > ahmed
> > 
> > 
> > Regards,
> > 
> > Rainer
> > 
> > > -------- Original-Nachricht --------
> > >> Datum: Wed, 20 Feb 2008 15:56:42 +0100
> > >> Von: Rainer Jung <rainer.jung@kippdata.de>
> > >> An: Tomcat Users List <users@tomcat.apache.org>
> > >> Betreff: Re: mod_jk Problems - - worker went to error state and
> dont
> > recover
> > > 
> > >> samk@twinix.com wrote:
> > >>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
> on
> > >> behalf of a User
> > >>> Hallo to all, After long unsuccessful research i hope someone can
> > >>> give me a hint to the following problems.
> > >>>
> > >>> Our Apache-mod_jk-Tomcat Infrastructur was running without
> Problems
> > >>> for about one year-than since two month mod_jk errors occurs.
> > >>> We upgraded the mod_jk Version, made improvements in the
> > >>> worker.properties - the problems changed and get less but
> sometimes
> > they
> > >>> appear further on.
> > >>>
> > >>> It seems that the mod_jk worker loose the connection to their
> > >>> Tomcat-Backendserver - there are messages in the mod_jk log Files
> > which
> > >>> points in this direction. Normally this seems not to be a big
> problem
> > -
> > >>> but under certain conditions (which ?) the worker goes to an error
> > state
> > >>> and cannot recover itself- must be done manually.
> > >>>
> > >>> Problem 1: The Tomcats are reachable - unknown why the workers
> think
> > the
> > >> server is dead ?
> > >>> Problem 2: I have no idea why the worker goes to an error state
> and
> > >> cannot recover.
> > >>
> > >> 2 is a consequence of 1
> > >>
> > >>> Problem3: I miss explanations of logged messages - i read the
> messages
> > -
> > >> but cannot match them to the situation - when does a worker post
> this
> > >> messages
> > >>
> > >> 1 is a consequence of these messages
> > >>
> > >>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
> > >> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi
> 
> > >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> > >> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
> waiting
> > reply from
> > >> tomcat. Tomcat is down, stopped or network problems (errno=110)
> > >>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
> > >> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
> from
> > tomcat failed with
> > >> out recovery in send loop attempt=0
> > >>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
> > >> service::jk_lb_worker.c (1105): unrecoverable error 504, request
> > failed. Tomcat failed in
> > >> the middle of request, we can't recover to another instance.
> > >>
> > >> The second line tells us, that your configured reply_timeout fired.
> > >> You set it to 120000 (2 minutes), so there are requests taking
> longer 
> > >> than 2 minutes on the backend, before the first response packet
> comes 
> > >> back from the backend.
> > >>
> > >> With your configuration mod_jk then doesn't wait any longer on the
> > reply 
> > >> *and puts the backend into error mode*.
> > >>
> > >> Up until version 1.2.25, if you use a reply-timeout, you need to
> set it
> > >> to a high number which justifies the resoning "if it takes that
> long, 
> > >> that something is wrong with the backend".
> > >>
> > >> Reality shows: there is no such number. Often there are few
> requests 
> > >> that take unaccetably long on the backend *although* the backend is
> 
> > >> still working.
> > >>
> > >> So in 1.2.25 we added max_reply_timeouts. With this set in addition
> to 
> > >> reply_timeout, mod_jk will abort waiting for a reply after 
> > >> reply_timeout, but allow some timeouts before actually deciding to
> put 
> > >> the backend into error.
> > >>
> > >> Unfortunately the implementation of max_reply_timeouts in 1.2.25
> was 
> > >> wrong, so you need to go to 1.2.26 to get it working right.
> > >>
> > >> See:
> > >>
> > >> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
> > >>
> > >> Caution: this does *not* explain, why the backends are not
> > automatically 
> > >> recovered after a minute of error condition. Maybe you have times,
> > where 
> > >> you getr to many of those reply_timeouts (see log file), and
> although
> > we 
> > >> recover after a minute the backend almost immediately goes back
> into 
> > >> error status.
> > >>
> > >>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
> can i
> > >> found details to errno=110 ?...
> > >>
> > >> reply_timeout, see above and also
> > >>
> > >> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
> > >>
> > >> errno: a standard unix feature. The numbers are platform dependent.
> I 
> > >> would assume in your case
> > >>
> > >> ETIMEDOUT       110     /* Connection timed out */
> > >>
> > >> so no wonder, that's exactly what we expect (and doesn't tell us
> the 
> > >> reason, i.e. what's wrong on the *backend* taking that long for a
> > >> response).
> > >>
> > >>> -> receiving reply from tomcat failed with out recovery in send
> loop
> > >> attempt=0  - ? with out recovery in send loop - means?
> > >>
> > >> That your configuration doesn't allow us to send the request to
> another
> > >> backend. recovery_options 7 include: if mod_jk was able to send the
> 
> > >> request to a backend, do not try to send it to another backend in
> case 
> > >> of an error during the response handling. Even if you would allow 
> > >> sending to another backend, it would not help with *not* putting
> the 
> > >> worker into error state. More likely would be, that you would put
> all 
> > >> workers into error state, because all of them might run into the
> same 
> > >> timeout, one after the other.
> > >>
> > >>> -> unrecoverable error 504 - details to this error ?
> > >> That's simply how we return the situation back to the client
> (browser).
> > >>
> > >>> Ok - i turn the logging level to debug - the course of events get
> > >>> more
> > >>> clear - but also more questions appear - there are socket numbers
> -
> > >>> which sockets - what are these numbers e.g will be shutting down
> > socket
> > >>> 35 for worker INETP1021 - The sockets are good for ? - how many
> are
> > >>> there/per worker ? can i configure them ?
> > >> Should not be the problem here. For apache httpd if you do *not* 
> > >> configure anything, we automatically choose the number of httpd
> threads
> > >> as the maximum number of connections. No need to change anything
> here.
> > >>> => Generally -How can i solve such problems - i tried to look into
> > >>> the
> > >>> mod_jk code - searching for error codes, error messages - but
> cannot
> > >>> find some relevant informations, - i am studying the log Files -
> but
> > >>> don't find out what really happens.
> > >> Post to the list. Improve our dics.
> > >>
> > >> The error message contains the word "timeout" and "reply" and you
> have
> > a 
> > >> "reply_timeout".
> > >>
> > >> Long running requests are a frequent problem. If you want to get
> rid of
> > >> them, start by adding response times to your httpd and your tomcat 
> > >> access log format (%D). Then have a look, which URLs are producing
> long
> > >> running requests, during what time of day are they happening etc.
> This 
> > >> might give you a clue about the reasons.
> > >>
> > >> And if they are very frequent: do Java Thread Dumps of your
> backends
> > and 
> > >> analyze them.
> > >>
> > >>> So - maybe someone has an idea why the worker think that the
> > >>> corresponding Tomcat is dead, and why he will not recover by
> itself. !
> > >> Tomecat is dead: from the point of view of mod_jk it simply means:
> we 
> > >> didn't get an answer, when we expected one. Details depend on the 
> > >> additional log lines (could not connect, reply timeout etc.).
> > >>
> > >>> And i am also searching for tips how i can help myself - and where
> to
> > >>> find something about the error codes, messages,..in mod_jk
> > >>>
> > >>> thanks for your attention
> > >>> Best
> > >>> ahmed musa (writing from vienna)
> > >>>
> > >> Regards,
> > >>
> > >> Rainer
> > >>
> > >>> Current Infrastructur
> > >>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
> > >> /Kernelversion 2.6.9-34
> > >>> In front of the Webserver there are two (two Locations)
> > HW-Loadbalancer
> > >> (but they have no role in this story)
> > >>> The Webservers are hosted at our ISP.
> > >>>  
> > >>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
> > >>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
> because
> > of
> > >>> underlying Application-Parts the OS is Windows 2003 Server - a
> long
> > >>> story not worth to explain :-) ). The Tomcatserver gain Data via
> > >>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
> > >>> Tomcatserver are Inhouse -and were rebooted nightly because of
> > automated
> > >>> Deployment processes.
> > >>>
> > >>> Between the Webserver and the Tomcatserver is a Checkpoint
> Firewall. 
> > >>> All webapps are deployed on all Tomcats - only mod_jk manages the
> > >>> requests to certain Tomcat- instances.
> > >>> (on one Bladeserver there are two identically Tomcat Instances
> > >>> running).
> > >>>
> > >>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
> against
> > >>> the public Website(s) are normal short living requests - not many
> -
> > The
> > >>> most Webapps (Portals) need a login, have a strong focus on
> business
> > >>> logic - so the instances are big (many MBs in RAM), the sessions
> are
> > >>> sticky and the session timeout is 20 minutes. But there are also
> less
> > >>> requests. To the User requests - Monitoring requests from our ISP
> are
> > >> added.
> > >>> The Problems appears at Servers/Portals which very less
> Userrequests.
> > >>>
> > >>> worker.properties
> > >>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
> > >>>
> > >>> worker.template.type=ajp13
> > >>> worker.template.lbfactor=5
> > >>> worker.template.socket_keepalive=1
> > >>> worker.template.connect_timeout=7000
> > >>> worker.template.prepost_timeout=5000
> > >>> worker.template.reply_timeout=120000
> > >>> worker.template.retries=6
> > >>> worker.template.activation=Active
> > >>> worker.template.recovery_options=7
> > >>>
> > >>> worker.lbtemplate.type=lb
> > >>> worker.lbtemplate.max_reply_timeouts=6
> > >>> worker.lbtemplate.method=Session
> > >>>
> > >>> #Produktions Worker
> > >>> # AS-INETP101 - 106 - 6/6 GGI
> > >>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
> > >>> worker.INETP1011.port=65001
> > >>> worker.INETP1011.reference=worker.template
> > >>>
> > >>> ....many more of the same
> > >>>
> > >>> then
> > >>>
> > >>> worker.ajp_ad.reference=worker.lbtemplate
> > >>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
> > >>>
> > >>> .... many more portals
> > >>>
> > >>> at least jkstatus
> > >>>
> > >>> The JKMount is very simple
> > >>> JkMount /* ajp_ad    --- for the other portals mostly the same
> > >>>
> > >>> The Portals are Virtual Hosts on the Apache.
> > >>>
> > >>> Tomcat - server.xml
> > >>> example
> > >>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
> > >>>     <Engine name="Catalina" jvmRoute="INETP5021"
> > defaultHost="default">
> > >>> ......
> > >>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
> > >>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
> > >>> xmlNamespaceAware="false">
> > >>>         <Alias>www.slfinsol.com</Alias>
> > >>>         <Alias>web1.slfinsol.com</Alias>
> > >>>         ...
> > >>>         <Alias>testweb.slfinsol.com</Alias>
> > >>>         .....
> > >>>         <Valve
> className="org.apache.catalina.valves.AccessLogValve"
> > >>> directory="logs" prefix="swl_access_log." suffix=".txt"
> > pattern="common"
> > >>> resolveHosts="false" />
> > >>>         <Valve
> > >>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
> > >>>         <Valve
> > >>>
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
> > >>>         <Context path="" docBase="swl" />
> > >>>         <Context path="/monitor5" docBase="monitor" />
> > >>>         <Context path="/swl" docBase="swl" />
> > >>>       </Host>    
> > 
> > ---------------------------------------------------------------------
> > To start a new topic, e-mail: users@tomcat.apache.org
> > To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> > For additional commands, e-mail: users-help@tomcat.apache.org
> 
> -- 
> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org

-- 
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf@gmx

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message