tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francois JEANMOUGIN" <Francois.JEANMOU...@123multimedia.com>
Subject TCPCluster problem on heavily loaded webapps
Date Mon, 25 Apr 2005 12:44:01 GMT
Hi,

I have several webapps using the TCP sessions cluster. It works well but
fails for one application. If I try to restart a tomcat (on node), I have a
lot of problems at startup.

We are using jakarta-tomcat-5.0.28 and jdk-1.5.0_02 on linux.

First, it fails to receive HeartBeat.
INFO: Replication member
added:org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193:413
8,xxx.xxx.xxx.193,4138, alive=26084032]
Creating ClusterManager for context  using class
org.apache.catalina.cluster.session.DeltaManager
- Starting clustering manager...:
- Wasn't able to read acknowledgement from
server[xxx.xxx.xxx.193/xxx.xxx.xxx.193:4138] in 15000 ms. Disconnecting
socket, and trying again.

Second, it seems to fail receiving sessions states:
Apr 24, 2005 9:27:28 PM
org.apache.catalina.cluster.tcp.ReplicationTransmitter sendMessageData
WARNING: Unable to send replicated message, is server down?
[Here is a stack trace abourt a connect timeout]
- Manager[], requesting session state from
org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193:413
8,xxx.xxx.xxx.193,4138, alive=26116195]. This operation will timeout if no
session state has been received within 60 seconds
- Manager[], No session state received, timing out.

If I do a netstat, I see a lot of connections from xxx.xxx.xxx.193 (the
active node) to xxx.xxx.xxx.191:4138 (the one which is restarting). There is
at least 900 active sessions on the active node, up to 3000 sessions.

After these errors, the application goes wild. Here is a 10s (ten seconds!)
GC activity:
[ParNew 303700K->287850K(1048512K), 0.0119170 secs]
[ParNew 302774K->294405K(1048512K), 0.0155040 secs]
[ParNew 309868K->303795K(1048512K), 0.0163760 secs]
[ParNew 318870K->311786K(1048512K), 0.0185220 secs]
[ParNew 333775K->321247K(1048512K), 0.0131610 secs]
[ParNew 336638K->330812K(1048512K), 0.0189300 secs]
[GC 334574K(1048512K), 0.0054240 secs]
[ParNew 345925K->338824K(1048512K), 0.0206210 secs]
[ParNew 354860K->348259K(1048512K), 0.0208230 secs]
[ParNew 367204K->354891K(1048512K), 0.0165160 secs]
[ParNew 371126K->361510K(1048512K), 0.0195450 secs]
[ParNew 377575K->369064K(1048512K), 0.0210910 secs]
[ParNew 384706K->375691K(1048512K), 0.0207330 secs]
[ParNew 390910K->382326K(1048512K), 0.0220530 secs]
[ParNew 398898K->387537K(1048512K), 0.0192510 secs]
[ParNew 403202K->397029K(1048512K), 0.0248840 secs]
[ParNew 413248K->403687K(1048512K), 0.0232250 secs]
[ParNew 419890K->410353K(1048512K), 0.0251610 secs]
[ParNew 426571K->415591K(1048512K), 0.0213720 secs]
[ParNew 431847K->425121K(1048512K), 0.0265630 secs]
[ParNew 440990K->432771K(1048512K), 0.0268060 secs]
[ParNew 449594K->439468K(1048512K), 0.0249090 secs]
[ParNew 453366K->447615K(1048512K), 0.0283870 secs]
[ParNew 463476K->455729K(1048512K), 0.0270220 secs]
[ParNew 470179K->463891K(1048512K), 0.0293800 secs]
[ParNew 479288K->473471K(1048512K), 0.0306060 secs]
[ParNew 489550K->481662K(1048512K), 0.0296620 secs]
[ParNew 497119K->491253K(1048512K), 0.0311730 secs]
[ParNew 507243K->500859K(1048512K), 0.0331790 secs]
[ParNew 516452K->510515K(1048512K), 0.0334260 secs]
[ParNew 526102K->520132K(1048512K), 0.0340760 secs]
[ParNew 535196K->526900K(1048512K), 0.0319520 secs]
[ParNew 542902K->532676K(1048512K), 0.0329050 secs]
[ParNew 548181K->540914K(1048512K), 0.0343770 secs]
[ParNew 556897K->550561K(1048512K), 0.0363350 secs]
[ParNew 566628K->561217K(1048512K), 0.0384490 secs]
[ParNew 577272K->567013K(1048512K), 0.0333970 secs]
[ParNew 582211K->573818K(1048512K), 0.0354440 secs]
[ParNew 590032K->580630K(1048512K), 0.0357870 secs]
[ParNew 594320K->588455K(1048512K), 0.0377730 secs]
[ParNew 604091K->598148K(1048512K), 0.0395440 secs]
[ParNew 614317K->607850K(1048512K), 0.0390720 secs]
[ParNew 614317K->607850K(1048512K), 0.0390720 secs]
[ParNew 622154K->616157K(1048512K), 0.0396040 secs]
[ParNew 632373K->624406K(1048512K), 0.0394080 secs]
[ParNew 640147K->631259K(1048512K), 0.0395380 secs]
[ParNew 646991K->640992K(1048512K), 0.0432430 secs]
[ParNew 657221K->649246K(1048512K), 0.0416180 secs]
[ParNew 664557K->657134K(1048512K), 0.0431080 secs]
[ParNew 673390K->666974K(1048512K), 0.0448190 secs]
[ParNew 683230K->673894K(1048512K), 0.0416870 secs]
[ParNew 689214K->682172K(1048512K), 0.0438750 secs]
[ParNew 698170K->690556K(1048512K), 0.0431870 secs]
[ParNew 706236K->695977K(1048512K), 0.0403150 secs]
[ParNew 711914K->705771K(1048512K), 0.0471480 secs]
[ParNew 721819K->714181K(1048512K), 0.0472770 secs]
[ParNew 730118K->723989K(1048512K), 0.0489510 secs]
[ParNew 739856K->728033K(1048512K), 0.0427410 secs]
[ParNew 744981K->737939K(1048512K), 0.0475430 secs]
[ParNew 753970K->744992K(1048512K), 0.0492700 secs]
[ParNew 760265K->751943K(1048512K), 0.0487280 secs]
[ParNew 767510K->760407K(1048512K), 0.0492930 secs]
[ParNew 774921K->767371K(1048512K), 0.0497330 secs]
[ParNew 781767K->777358K(1048512K), 0.0515280 secs]
[ParNew 796187K->785857K(1048512K), 0.0489430 secs]
[ParNew 802012K->791018K(1048512K), 0.0492380 secs]
[ParNew 806717K->798023K(1048512K), 0.0508110 secs]
[ParNew 813756K->805026K(1048512K), 0.0502660 secs]
[ParNew 820300K->813410K(1048512K), 0.0523530 secs]
[ParNew 828669K->820421K(1048512K), 0.0531150 secs]
[ParNew 835723K->828524K(1048512K), 0.0555550 secs]
[ParNew 843832K->837071K(1048512K), 0.0562570 secs]
[ParNew 853327K->847000K(1048512K), 0.0570410 secs]
[ParNew 866156K->856950K(1048512K), 0.0548350 secs]
[ParNew 872116K->864444K(1048512K), 0.0551520 secs]
[ParNew 880555K->874405K(1048512K), 0.0588060 secs]
[ParNew 891761K->884374K(1048512K), 0.0614500 secs]
[ParNew 901149K->892649K(1048512K), 0.0604340 secs]
[ParNew 909993K->899791K(1048512K), 0.0571610 secs]
[ParNew 915463K->905058K(1048512K), 0.0550040 secs]
[ParNew 921114K->915056K(1048512K), 0.0633450 secs]
[ParNew 930810K->928141K(1048512K), 0.0671490 secs]
[ParNew 943358K->932331K(1048512K), 0.0593550 secs]
[ParNew 947614K->936528K(1048512K), 0.0604400 secs]
[ParNew 952766K->946550K(1048512K), 0.0666420 secs]
[ParNew 962141K->955223K(1048512K), 0.0655340 secs]
[ParNew 971115K->962350K(1048512K), 0.0617730 secs]
[ParNew 978155K->969481K(1048512K), 0.0618740 secs]
[ParNew 985539K->979532K(1048512K), 0.0669180 secs]
[ParNew 995150K->983764K(1048512K), 0.0599040 secs]
[ParNew 998815K->990921K(1048512K), 0.0688710 secs]
[ParNew 1004588K->996963K(1048512K), 0.0645710 secs]
[ParNew 1015805K->1008164K(1048512K), 0.0683150 secs]
[ParNew 1023895K->1015340K(1048512K), 0.0679410 secs]
[ParNew 1031291K->1026561K(1048512K), 0.0752900 secs]
[Full GC 1042324K->261308K(1048512K), 1.7138170 secs]
Again and again.

Sometimes, after a while, the cluster falls in order. Then, I don't touch it
anymore, and go to the church... The other replicated applications (on the
same tomcat instance) are deploying well (after waiting for this one to fail,
not a parallel process I see)...

Well, I circumvanted the problem by using a asynchronous replication method
for this site. So, what I understand is that there is something wrong in the
way the heartbeat is managed. If the active node is busy trying to replicate
(pooled method) the sessions, it doesn't answer to the heartbeat. An the
other side, the starting node, which is receiving about 15 connections on its
replication port, does not deduced that the other node is alive.

Using asynchronous replication let the active node answer the heartbeat
request, and everything goes well.

So, I am open to any suggestions (including migration from 5.0.28 to 5.5.x)
that would correct this bug (or misfeature).

Here is the configuration we are using:

<Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster"
  managerClassName="org.apache.catalina.cluster.session.DeltaManager"
                debug="10"
                expireSessionsOnShutdown="false"
                useDirtyFlag="true">
            <Membership
                className="org.apache.catalina.cluster.mcast.McastService"
                mcastAddr="228.0.0.137"
                mcastPort="45601"
                mcastFrequency="500"
                mcastDropTime="3000"/>
            <Receiver
             className="org.apache.catalina.cluster.tcp.ReplicationListener"
                tcpListenAddress="auto"
                tcpListenPort="4138"
                tcpSelectorTimeout="100"
                tcpThreadCount="50"/>
            <Sender
          className="org.apache.catalina.cluster.tcp.ReplicationTransmitter"
                replicationMode="asynchronous"/>
<Valve className="org.apache.catalina.cluster.tcp.ReplicationValve"
filter=".*\.gif;.*\.js;.*\.jpg;.*\.htm;.*\.html;.*\.txt; .*\.css; .*\.swf;"/>
</Cluster>

Any help appreciated.

Fran├žois.

---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-user-help@jakarta.apache.org


Mime
View raw message