Return-Path: Delivered-To: apmail-jakarta-tomcat-user-archive@www.apache.org Received: (qmail 3547 invoked from network); 25 Apr 2005 12:44:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 Apr 2005 12:44:56 -0000 Received: (qmail 8902 invoked by uid 500); 25 Apr 2005 12:44:46 -0000 Delivered-To: apmail-jakarta-tomcat-user-archive@jakarta.apache.org Received: (qmail 8884 invoked by uid 500); 25 Apr 2005 12:44:45 -0000 Mailing-List: contact tomcat-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Tomcat Users List" Reply-To: "Tomcat Users List" Delivered-To: mailing list tomcat-user@jakarta.apache.org Received: (qmail 8870 invoked by uid 99); 25 Apr 2005 12:44:45 -0000 X-ASF-Spam-Status: No, hits=0.6 required=10.0 tests=HOT_NASTY X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from lxmx1.123multimedia.com (HELO lxmx1.123multimedia.com) (194.250.185.231) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 25 Apr 2005 05:44:45 -0700 Received: by lxmx1.123multimedia.com (Postfix, from userid 5002) id 081666C0D8; Mon, 25 Apr 2005 14:44:04 +0200 (CEST) Received: from DEVSERV04.tls.123multimedia.com (unknown [192.168.240.2]) by lxmx1.123multimedia.com (Postfix) with ESMTP id A5E996C0D2 for ; Mon, 25 Apr 2005 14:44:03 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Subject: TCPCluster problem on heavily loaded webapps Date: Mon, 25 Apr 2005 14:44:01 +0200 Message-ID: <4B05B471CA57554C8E0361B442971086B4F431@devserv04.tls.123multimedia.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-topic: TCPCluster problem on heavily loaded webapps Thread-index: AcVJlHUeUvdgBuCIRMSC9uDwR1JwgQ== From: "Francois JEANMOUGIN" To: "Tomcat Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi, I have several webapps using the TCP sessions cluster. It works well but fails for one application. If I try to restart a tomcat (on node), I = have a lot of problems at startup. We are using jakarta-tomcat-5.0.28 and jdk-1.5.0_02 on linux. First, it fails to receive HeartBeat. INFO: Replication member added:org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193= :413 8,xxx.xxx.xxx.193,4138, alive=3D26084032] Creating ClusterManager for context using class org.apache.catalina.cluster.session.DeltaManager - Starting clustering manager...: - Wasn't able to read acknowledgement from server[xxx.xxx.xxx.193/xxx.xxx.xxx.193:4138] in 15000 ms. Disconnecting socket, and trying again. Second, it seems to fail receiving sessions states: Apr 24, 2005 9:27:28 PM org.apache.catalina.cluster.tcp.ReplicationTransmitter sendMessageData WARNING: Unable to send replicated message, is server down? [Here is a stack trace abourt a connect timeout] - Manager[], requesting session state from org.apache.catalina.cluster.mcast.McastMember[tcp://xxx.xxx.xxx.193:413 8,xxx.xxx.xxx.193,4138, alive=3D26116195]. This operation will timeout = if no session state has been received within 60 seconds - Manager[], No session state received, timing out. If I do a netstat, I see a lot of connections from xxx.xxx.xxx.193 (the active node) to xxx.xxx.xxx.191:4138 (the one which is restarting). = There is at least 900 active sessions on the active node, up to 3000 sessions. After these errors, the application goes wild. Here is a 10s (ten = seconds!) GC activity: [ParNew 303700K->287850K(1048512K), 0.0119170 secs] [ParNew 302774K->294405K(1048512K), 0.0155040 secs] [ParNew 309868K->303795K(1048512K), 0.0163760 secs] [ParNew 318870K->311786K(1048512K), 0.0185220 secs] [ParNew 333775K->321247K(1048512K), 0.0131610 secs] [ParNew 336638K->330812K(1048512K), 0.0189300 secs] [GC 334574K(1048512K), 0.0054240 secs] [ParNew 345925K->338824K(1048512K), 0.0206210 secs] [ParNew 354860K->348259K(1048512K), 0.0208230 secs] [ParNew 367204K->354891K(1048512K), 0.0165160 secs] [ParNew 371126K->361510K(1048512K), 0.0195450 secs] [ParNew 377575K->369064K(1048512K), 0.0210910 secs] [ParNew 384706K->375691K(1048512K), 0.0207330 secs] [ParNew 390910K->382326K(1048512K), 0.0220530 secs] [ParNew 398898K->387537K(1048512K), 0.0192510 secs] [ParNew 403202K->397029K(1048512K), 0.0248840 secs] [ParNew 413248K->403687K(1048512K), 0.0232250 secs] [ParNew 419890K->410353K(1048512K), 0.0251610 secs] [ParNew 426571K->415591K(1048512K), 0.0213720 secs] [ParNew 431847K->425121K(1048512K), 0.0265630 secs] [ParNew 440990K->432771K(1048512K), 0.0268060 secs] [ParNew 449594K->439468K(1048512K), 0.0249090 secs] [ParNew 453366K->447615K(1048512K), 0.0283870 secs] [ParNew 463476K->455729K(1048512K), 0.0270220 secs] [ParNew 470179K->463891K(1048512K), 0.0293800 secs] [ParNew 479288K->473471K(1048512K), 0.0306060 secs] [ParNew 489550K->481662K(1048512K), 0.0296620 secs] [ParNew 497119K->491253K(1048512K), 0.0311730 secs] [ParNew 507243K->500859K(1048512K), 0.0331790 secs] [ParNew 516452K->510515K(1048512K), 0.0334260 secs] [ParNew 526102K->520132K(1048512K), 0.0340760 secs] [ParNew 535196K->526900K(1048512K), 0.0319520 secs] [ParNew 542902K->532676K(1048512K), 0.0329050 secs] [ParNew 548181K->540914K(1048512K), 0.0343770 secs] [ParNew 556897K->550561K(1048512K), 0.0363350 secs] [ParNew 566628K->561217K(1048512K), 0.0384490 secs] [ParNew 577272K->567013K(1048512K), 0.0333970 secs] [ParNew 582211K->573818K(1048512K), 0.0354440 secs] [ParNew 590032K->580630K(1048512K), 0.0357870 secs] [ParNew 594320K->588455K(1048512K), 0.0377730 secs] [ParNew 604091K->598148K(1048512K), 0.0395440 secs] [ParNew 614317K->607850K(1048512K), 0.0390720 secs] [ParNew 614317K->607850K(1048512K), 0.0390720 secs] [ParNew 622154K->616157K(1048512K), 0.0396040 secs] [ParNew 632373K->624406K(1048512K), 0.0394080 secs] [ParNew 640147K->631259K(1048512K), 0.0395380 secs] [ParNew 646991K->640992K(1048512K), 0.0432430 secs] [ParNew 657221K->649246K(1048512K), 0.0416180 secs] [ParNew 664557K->657134K(1048512K), 0.0431080 secs] [ParNew 673390K->666974K(1048512K), 0.0448190 secs] [ParNew 683230K->673894K(1048512K), 0.0416870 secs] [ParNew 689214K->682172K(1048512K), 0.0438750 secs] [ParNew 698170K->690556K(1048512K), 0.0431870 secs] [ParNew 706236K->695977K(1048512K), 0.0403150 secs] [ParNew 711914K->705771K(1048512K), 0.0471480 secs] [ParNew 721819K->714181K(1048512K), 0.0472770 secs] [ParNew 730118K->723989K(1048512K), 0.0489510 secs] [ParNew 739856K->728033K(1048512K), 0.0427410 secs] [ParNew 744981K->737939K(1048512K), 0.0475430 secs] [ParNew 753970K->744992K(1048512K), 0.0492700 secs] [ParNew 760265K->751943K(1048512K), 0.0487280 secs] [ParNew 767510K->760407K(1048512K), 0.0492930 secs] [ParNew 774921K->767371K(1048512K), 0.0497330 secs] [ParNew 781767K->777358K(1048512K), 0.0515280 secs] [ParNew 796187K->785857K(1048512K), 0.0489430 secs] [ParNew 802012K->791018K(1048512K), 0.0492380 secs] [ParNew 806717K->798023K(1048512K), 0.0508110 secs] [ParNew 813756K->805026K(1048512K), 0.0502660 secs] [ParNew 820300K->813410K(1048512K), 0.0523530 secs] [ParNew 828669K->820421K(1048512K), 0.0531150 secs] [ParNew 835723K->828524K(1048512K), 0.0555550 secs] [ParNew 843832K->837071K(1048512K), 0.0562570 secs] [ParNew 853327K->847000K(1048512K), 0.0570410 secs] [ParNew 866156K->856950K(1048512K), 0.0548350 secs] [ParNew 872116K->864444K(1048512K), 0.0551520 secs] [ParNew 880555K->874405K(1048512K), 0.0588060 secs] [ParNew 891761K->884374K(1048512K), 0.0614500 secs] [ParNew 901149K->892649K(1048512K), 0.0604340 secs] [ParNew 909993K->899791K(1048512K), 0.0571610 secs] [ParNew 915463K->905058K(1048512K), 0.0550040 secs] [ParNew 921114K->915056K(1048512K), 0.0633450 secs] [ParNew 930810K->928141K(1048512K), 0.0671490 secs] [ParNew 943358K->932331K(1048512K), 0.0593550 secs] [ParNew 947614K->936528K(1048512K), 0.0604400 secs] [ParNew 952766K->946550K(1048512K), 0.0666420 secs] [ParNew 962141K->955223K(1048512K), 0.0655340 secs] [ParNew 971115K->962350K(1048512K), 0.0617730 secs] [ParNew 978155K->969481K(1048512K), 0.0618740 secs] [ParNew 985539K->979532K(1048512K), 0.0669180 secs] [ParNew 995150K->983764K(1048512K), 0.0599040 secs] [ParNew 998815K->990921K(1048512K), 0.0688710 secs] [ParNew 1004588K->996963K(1048512K), 0.0645710 secs] [ParNew 1015805K->1008164K(1048512K), 0.0683150 secs] [ParNew 1023895K->1015340K(1048512K), 0.0679410 secs] [ParNew 1031291K->1026561K(1048512K), 0.0752900 secs] [Full GC 1042324K->261308K(1048512K), 1.7138170 secs] Again and again. Sometimes, after a while, the cluster falls in order. Then, I don't = touch it anymore, and go to the church... The other replicated applications (on = the same tomcat instance) are deploying well (after waiting for this one to = fail, not a parallel process I see)... Well, I circumvanted the problem by using a asynchronous replication = method for this site. So, what I understand is that there is something wrong in = the way the heartbeat is managed. If the active node is busy trying to = replicate (pooled method) the sessions, it doesn't answer to the heartbeat. An the other side, the starting node, which is receiving about 15 connections = on its replication port, does not deduced that the other node is alive. Using asynchronous replication let the active node answer the heartbeat request, and everything goes well. So, I am open to any suggestions (including migration from 5.0.28 to = 5.5.x) that would correct this bug (or misfeature). Here is the configuration we are using: Any help appreciated. Fran=E7ois. --------------------------------------------------------------------- To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: tomcat-user-help@jakarta.apache.org