tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Rossbach ...@objektpark.de>
Subject Re: TCPCluster problem on heavily loaded webapps
Date Mon, 25 Apr 2005 18:20:17 GMT
Hello Francois,

I have implemented a new restart algo at the current Tomcat 5.5 cvs head.
Compile it and test it. I hope then this strange failure is gone. :-)

   <Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster"
                 
managerClassName="org.apache.catalina.cluster.session.DeltaManager"
                 expireSessionsOnShutdown="false"
                 notifyListenersOnReplication="false"
                 notifySessionListenersOnReplication="false"
                 sendAllSessions="false"
                 sendAllSessionsSize="500"
                 sendAllSessionsWaitTime="20"
                 doClusterLog="true"
                 clusterLogName="clusterlog"
                  stateTransferTimeout="60"
                >
 
              <Membership
                  className="org.apache.catalina.cluster.mcast.McastService"
                  mcastAddr="228.0.0.4"
                  mcastPort="45564"
                  mcastFrequency="500"
                  mcastDropTime="3000"/>

              <Receiver
                  
className="org.apache.catalina.cluster.tcp.ReplicationListener"
                  tcpListenAddress="auto"
                  tcpListenPort="9015"
                  tcpSelectorTimeout="100"
                  tcpThreadCount="6"
                  />
                
              <Sender
                  
className="org.apache.catalina.cluster.tcp.ReplicationTransmitter"
                  replicationMode="fastasyncqueue"
                  doTransmitterProcessingStats="true"
                  doProcessingStats="true"
                  doWaitAckStats="true"
                  queueTimeWait="true"
                  queueDoStats="true"
                  queueCheckLock="true"
                  ackTimeout="15000"
                  waitForAck="true"
                  autoConnect="false"
                  keepAliveTimeout="80000"
                  keepAliveMaxRequestCount="-1"/>
 
              <Valve 
className="org.apache.catalina.cluster.tcp.ReplicationValve"
                     
filter=".*\.gif;.*\.js;.*\.css;.*\.png;.*\.jpeg;.*\.jpg;.*\.htm;.*\.html;.*\.txt;"
                     primaryIndicator="true" />
 
       <ClusterListener 
className="org.apache.catalina.cluster.session.ClusterSessionListener" />
  </Cluster>

One of the existing problems is, that the wait act timeout is short for 
large session replication messages.
This is the reason behind
                 sendAllSessions="false"
                 sendAllSessionsSize="500"
                 sendAllSessionsWaitTime="20"

The session state messages can be spilt in some blocks.

Peter


Francois JEANMOUGIN schrieb:

>Hi,
>
>I have several webapps using the TCP sessions cluster. It works well but
>fails for one application. If I try to restart a tomcat (on node), I have a
>lot of problems at startup.
>
>We are using jakarta-tomcat-5.0.28 and jdk-1.5.0_02 on linux.
>
>First, it fails to receive HeartBeat.
>INFO: Replication member
>added:org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193:413
>8,xxx.xxx.xxx.193,4138, alive=26084032]
>Creating ClusterManager for context  using class
>org.apache.catalina.cluster.session.DeltaManager
>- Starting clustering manager...:
>- Wasn't able to read acknowledgement from
>server[xxx.xxx.xxx.193/xxx.xxx.xxx.193:4138] in 15000 ms. Disconnecting
>socket, and trying again.
>
>Second, it seems to fail receiving sessions states:
>Apr 24, 2005 9:27:28 PM
>org.apache.catalina.cluster.tcp.ReplicationTransmitter sendMessageData
>WARNING: Unable to send replicated message, is server down?
>[Here is a stack trace abourt a connect timeout]
>- Manager[], requesting session state from
>org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193:413
>8,xxx.xxx.xxx.193,4138, alive=26116195]. This operation will timeout if no
>session state has been received within 60 seconds
>- Manager[], No session state received, timing out.
>
>If I do a netstat, I see a lot of connections from xxx.xxx.xxx.193 (the
>active node) to xxx.xxx.xxx.191:4138 (the one which is restarting). There is
>at least 900 active sessions on the active node, up to 3000 sessions.
>
>After these errors, the application goes wild. Here is a 10s (ten seconds!)
>GC activity:
>[ParNew 303700K->287850K(1048512K), 0.0119170 secs]
>[ParNew 302774K->294405K(1048512K), 0.0155040 secs]
>[ParNew 309868K->303795K(1048512K), 0.0163760 secs]
>[ParNew 318870K->311786K(1048512K), 0.0185220 secs]
>[ParNew 333775K->321247K(1048512K), 0.0131610 secs]
>[ParNew 336638K->330812K(1048512K), 0.0189300 secs]
>[GC 334574K(1048512K), 0.0054240 secs]
>[ParNew 345925K->338824K(1048512K), 0.0206210 secs]
>[ParNew 354860K->348259K(1048512K), 0.0208230 secs]
>[ParNew 367204K->354891K(1048512K), 0.0165160 secs]
>[ParNew 371126K->361510K(1048512K), 0.0195450 secs]
>[ParNew 377575K->369064K(1048512K), 0.0210910 secs]
>[ParNew 384706K->375691K(1048512K), 0.0207330 secs]
>[ParNew 390910K->382326K(1048512K), 0.0220530 secs]
>[ParNew 398898K->387537K(1048512K), 0.0192510 secs]
>[ParNew 403202K->397029K(1048512K), 0.0248840 secs]
>[ParNew 413248K->403687K(1048512K), 0.0232250 secs]
>[ParNew 419890K->410353K(1048512K), 0.0251610 secs]
>[ParNew 426571K->415591K(1048512K), 0.0213720 secs]
>[ParNew 431847K->425121K(1048512K), 0.0265630 secs]
>[ParNew 440990K->432771K(1048512K), 0.0268060 secs]
>[ParNew 449594K->439468K(1048512K), 0.0249090 secs]
>[ParNew 453366K->447615K(1048512K), 0.0283870 secs]
>[ParNew 463476K->455729K(1048512K), 0.0270220 secs]
>[ParNew 470179K->463891K(1048512K), 0.0293800 secs]
>[ParNew 479288K->473471K(1048512K), 0.0306060 secs]
>[ParNew 489550K->481662K(1048512K), 0.0296620 secs]
>[ParNew 497119K->491253K(1048512K), 0.0311730 secs]
>[ParNew 507243K->500859K(1048512K), 0.0331790 secs]
>[ParNew 516452K->510515K(1048512K), 0.0334260 secs]
>[ParNew 526102K->520132K(1048512K), 0.0340760 secs]
>[ParNew 535196K->526900K(1048512K), 0.0319520 secs]
>[ParNew 542902K->532676K(1048512K), 0.0329050 secs]
>[ParNew 548181K->540914K(1048512K), 0.0343770 secs]
>[ParNew 556897K->550561K(1048512K), 0.0363350 secs]
>[ParNew 566628K->561217K(1048512K), 0.0384490 secs]
>[ParNew 577272K->567013K(1048512K), 0.0333970 secs]
>[ParNew 582211K->573818K(1048512K), 0.0354440 secs]
>[ParNew 590032K->580630K(1048512K), 0.0357870 secs]
>[ParNew 594320K->588455K(1048512K), 0.0377730 secs]
>[ParNew 604091K->598148K(1048512K), 0.0395440 secs]
>[ParNew 614317K->607850K(1048512K), 0.0390720 secs]
>[ParNew 614317K->607850K(1048512K), 0.0390720 secs]
>[ParNew 622154K->616157K(1048512K), 0.0396040 secs]
>[ParNew 632373K->624406K(1048512K), 0.0394080 secs]
>[ParNew 640147K->631259K(1048512K), 0.0395380 secs]
>[ParNew 646991K->640992K(1048512K), 0.0432430 secs]
>[ParNew 657221K->649246K(1048512K), 0.0416180 secs]
>[ParNew 664557K->657134K(1048512K), 0.0431080 secs]
>[ParNew 673390K->666974K(1048512K), 0.0448190 secs]
>[ParNew 683230K->673894K(1048512K), 0.0416870 secs]
>[ParNew 689214K->682172K(1048512K), 0.0438750 secs]
>[ParNew 698170K->690556K(1048512K), 0.0431870 secs]
>[ParNew 706236K->695977K(1048512K), 0.0403150 secs]
>[ParNew 711914K->705771K(1048512K), 0.0471480 secs]
>[ParNew 721819K->714181K(1048512K), 0.0472770 secs]
>[ParNew 730118K->723989K(1048512K), 0.0489510 secs]
>[ParNew 739856K->728033K(1048512K), 0.0427410 secs]
>[ParNew 744981K->737939K(1048512K), 0.0475430 secs]
>[ParNew 753970K->744992K(1048512K), 0.0492700 secs]
>[ParNew 760265K->751943K(1048512K), 0.0487280 secs]
>[ParNew 767510K->760407K(1048512K), 0.0492930 secs]
>[ParNew 774921K->767371K(1048512K), 0.0497330 secs]
>[ParNew 781767K->777358K(1048512K), 0.0515280 secs]
>[ParNew 796187K->785857K(1048512K), 0.0489430 secs]
>[ParNew 802012K->791018K(1048512K), 0.0492380 secs]
>[ParNew 806717K->798023K(1048512K), 0.0508110 secs]
>[ParNew 813756K->805026K(1048512K), 0.0502660 secs]
>[ParNew 820300K->813410K(1048512K), 0.0523530 secs]
>[ParNew 828669K->820421K(1048512K), 0.0531150 secs]
>[ParNew 835723K->828524K(1048512K), 0.0555550 secs]
>[ParNew 843832K->837071K(1048512K), 0.0562570 secs]
>[ParNew 853327K->847000K(1048512K), 0.0570410 secs]
>[ParNew 866156K->856950K(1048512K), 0.0548350 secs]
>[ParNew 872116K->864444K(1048512K), 0.0551520 secs]
>[ParNew 880555K->874405K(1048512K), 0.0588060 secs]
>[ParNew 891761K->884374K(1048512K), 0.0614500 secs]
>[ParNew 901149K->892649K(1048512K), 0.0604340 secs]
>[ParNew 909993K->899791K(1048512K), 0.0571610 secs]
>[ParNew 915463K->905058K(1048512K), 0.0550040 secs]
>[ParNew 921114K->915056K(1048512K), 0.0633450 secs]
>[ParNew 930810K->928141K(1048512K), 0.0671490 secs]
>[ParNew 943358K->932331K(1048512K), 0.0593550 secs]
>[ParNew 947614K->936528K(1048512K), 0.0604400 secs]
>[ParNew 952766K->946550K(1048512K), 0.0666420 secs]
>[ParNew 962141K->955223K(1048512K), 0.0655340 secs]
>[ParNew 971115K->962350K(1048512K), 0.0617730 secs]
>[ParNew 978155K->969481K(1048512K), 0.0618740 secs]
>[ParNew 985539K->979532K(1048512K), 0.0669180 secs]
>[ParNew 995150K->983764K(1048512K), 0.0599040 secs]
>[ParNew 998815K->990921K(1048512K), 0.0688710 secs]
>[ParNew 1004588K->996963K(1048512K), 0.0645710 secs]
>[ParNew 1015805K->1008164K(1048512K), 0.0683150 secs]
>[ParNew 1023895K->1015340K(1048512K), 0.0679410 secs]
>[ParNew 1031291K->1026561K(1048512K), 0.0752900 secs]
>[Full GC 1042324K->261308K(1048512K), 1.7138170 secs]
>Again and again.
>
>Sometimes, after a while, the cluster falls in order. Then, I don't touch it
>anymore, and go to the church... The other replicated applications (on the
>same tomcat instance) are deploying well (after waiting for this one to fail,
>not a parallel process I see)...
>
>Well, I circumvanted the problem by using a asynchronous replication method
>for this site. So, what I understand is that there is something wrong in the
>way the heartbeat is managed. If the active node is busy trying to replicate
>(pooled method) the sessions, it doesn't answer to the heartbeat. An the
>other side, the starting node, which is receiving about 15 connections on its
>replication port, does not deduced that the other node is alive.
>
>Using asynchronous replication let the active node answer the heartbeat
>request, and everything goes well.
>
>So, I am open to any suggestions (including migration from 5.0.28 to 5.5.x)
>that would correct this bug (or misfeature).
>
>Here is the configuration we are using:
>
><Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster"
>  managerClassName="org.apache.catalina.cluster.session.DeltaManager"
>                debug="10"
>                expireSessionsOnShutdown="false"
>                useDirtyFlag="true">
>            <Membership
>                className="org.apache.catalina.cluster.mcast.McastService"
>                mcastAddr="228.0.0.137"
>                mcastPort="45601"
>                mcastFrequency="500"
>                mcastDropTime="3000"/>
>            <Receiver
>             className="org.apache.catalina.cluster.tcp.ReplicationListener"
>                tcpListenAddress="auto"
>                tcpListenPort="4138"
>                tcpSelectorTimeout="100"
>                tcpThreadCount="50"/>
>            <Sender
>          className="org.apache.catalina.cluster.tcp.ReplicationTransmitter"
>                replicationMode="asynchronous"/>
><Valve className="org.apache.catalina.cluster.tcp.ReplicationValve"
>filter=".*\.gif;.*\.js;.*\.jpg;.*\.htm;.*\.html;.*\.txt; .*\.css; .*\.swf;"/>
></Cluster>
>
>Any help appreciated.
>
>Fran├žois.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: tomcat-user-help@jakarta.apache.org
>
>
>
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-user-help@jakarta.apache.org


Mime
View raw message