Return-Path: Delivered-To: apmail-tomcat-users-archive@www.apache.org Received: (qmail 14155 invoked from network); 4 Feb 2010 13:35:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Feb 2010 13:35:02 -0000 Received: (qmail 89047 invoked by uid 500); 4 Feb 2010 13:34:58 -0000 Delivered-To: apmail-tomcat-users-archive@tomcat.apache.org Received: (qmail 88972 invoked by uid 500); 4 Feb 2010 13:34:58 -0000 Mailing-List: contact users-help@tomcat.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Tomcat Users List" Delivered-To: mailing list users@tomcat.apache.org Received: (qmail 88961 invoked by uid 99); 4 Feb 2010 13:34:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 13:34:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [174.141.46.202] (HELO mars.etrak-plus.com) (174.141.46.202) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Feb 2010 13:34:48 +0000 Received: (qmail 17330 invoked by uid 1010); 4 Feb 2010 08:31:53 -0500 Received: from carl@etrak-plus.com by mars by uid 1007 with qmail-scanner-1.22-st-qms (spamassassin: 2.63. Clear:RC:1(192.168.0.106):. Processed in 0.020321 secs); 04 Feb 2010 13:31:53 -0000 X-Antivirus-MYDOMAIN-Mail-From: carl@etrak-plus.com via mars X-Antivirus-MYDOMAIN: 1.22-st-qms (Clear:RC:1(192.168.0.106):. Processed in 0.020321 secs Process 17326) Received: from unknown (HELO dan) (192.168.0.106) by mars.etrak-plus.com with SMTP; Thu, 04 Feb 2010 08:31:53 -0500 Message-ID: <052501caa59e$99d59e50$6a00a8c0@dan> From: "Carl" To: "Tomcat Users List" References: <251235.31611.qm@web56607.mail.re3.yahoo.com> Subject: Re: Tomcat dies suddenly Date: Thu, 4 Feb 2010 08:33:13 -0500 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.2180 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Mark, This was both helpful and intriguing. 1. I had always used top to see memory used until I saw the system monitor tools in Slackware. Had not compared the two. At this moment, the system monitor is reporting .96GB of memory used while top and vmstat are reporting 3.6GB... quite a difference. From now on, top/vmstat it is. Further, the fact that this machine is running that close to the 4GB physical memory would seem to make it a candidate for failure with a fair amount of activity. Today could be interesting and revealing. 2. The only reference to 'RunTime' I could find in the code was in a try-catch in the ASTranslatorFactory where it throws a RunTimeException. We use this package in the process for communicating with Flash applications (part of our application uses Flash to provide a richer environment.) The ASTranslator jars are the latest ones and have not been changed since the middle of 2007. I am not certain how the process works inside but I would have thought the jars would have been updated if there were problems. 3. I am not certain I understood your explanation of potential DNS problems. This server is very simple: it receives requests from the outside, processes those (usually accessing a data server which is in the /etc/hosts file) and sends the response on its way. During the processing, there is no accessing the outside world that I know of. I would think if there was a request to the outside world that was causing a problem, we would see failure in a specific are of the overall system but we are not seeing anything like that. I did cut the Xms, Xmx in half in an attempt to force the problem but nothing happened (the system worked just fine) and I have since moved it back to it's old setting (1024m.) Thanks for your ideas and comments. Carl ----- Original Message ----- From: "Mark Eggers" To: "Tomcat Users List" Sent: Wednesday, February 03, 2010 11:46 PM Subject: Re: Tomcat dies suddenly Carl, A couple of random thoughts . . . I'm not familiar with the Slackware monitoring tools, but I am with the various tools that come with Fedora / Redhat. One of the things that I've noticed with those GUI tools is that they add cache and buffers to the free memory total. Tools like top and vmstat should give a more complete picture of your memory. With vmstat you can watch free, cache, buffers, and swap conveniently. With top, you can actually do a command line monitor and watch a particular PID. >From the taroon-list: If you're running a 32 bit Linux and run out of low memory, it doesn't matter how much high memory you have, the OOM killer will start killing processes off. Since you're running a 64 bit Linux, this should not be the problem. A discussion on stackoverflow.com may be more relevant to your situation. It turns out (according to the discussion) that calling Runtime.getRuntime().exec() on a busy system can lead to transient memory shortages which trigger the OOM killer. If Runtime.getRuntime().exec() or similar calls do not exist in your application, then please skip the following speculation. I've made some comments concerning host resolution at the end of this message which might be helpful. If Runtime.getRuntime().exec() is used, the scenario goes like this: 1. call Runtime.getRuntime().exec() 2. fork() gets called and makes a copy of the parent process 3. System runs a different process At this point you have two processes with largish memory requirements At this point the OOM killer may get triggered 4. exec() gets called on the child process and memory requirements go back down. At least that's how I read the this reference: http://stackoverflow.com/questions/209875/from-what-linux-kernel-libc-version-is-java-runtime-exec-safe-with-regards-to-m Since processes that fork a lot of child processes are high on OOM killer's kill list, Tomcat gets killed. See for example: http://prefetch.net/blog/index.php/2009/09/30/how-the-linux-oom-killer-works/ As to why it would happen on the newer production systems and not the older system, my only idea concerns the version of the kernel you're using. Memory management has been significantly reworked between the 2.4 and 2.6 kernels. If you use a 2.4 kernel on your older system, this could explain some of the differences with memory allocation. So, if Runtime.getRuntime().exec() is used, what are some possible solutions? 1. Reducing Xms, Xmx while adding physical memory If you do this, then the fork() call without the exec() being called directly afterwards won't be as expensive. Your application will be able to serve more clients without potentially triggering the OOM killer. Garbage collection may be an issue if this is done, so tuning with JMeter is probably a good idea. 2. Create a lightweight process that forks what Runtime.getRuntime().exec() calls and communicate with the process over sockets. This is pretty unpleasant, but you might be able to treat this as a remote process server. You could then end up using a custom object, JNDI lookups, and pooling, much like database pooling. As I've said, this is all based on an assumption that the application is requesting a transiently large amount of memory caused by Runtime.getRuntime().exec() or other similar action. If this is not the case, then the above arguments are null and void. DNS Thoughts As for the ideas concerning DNS - I've never seen DNS issues actually take down an environment. However, I've seen orders of magnitude performance issues caused by poorly configured DNS resolution and missing DNS entries. One way to test DNS performance issues is to set up a client with a static IP address, but don't put it in your local DNS. Then run JMeter on this client and stress your server. Finally, add the client into DNS and stress your server with JMeter. If you notice a difference, then there are some issues with how your server uses host resolution. Make sure that nonexistent address resolution services (nisplus, nis, hesiod) are not listed as sources on the host line in /etc/nsswitch.conf (or wherever Slackware puts it). At least put a [NOTFOUND=return] entry after dns but before all the other services listed on the hosts: line of the nsswitch.conf file. So, here's a summary to all of this rambling: 1. Monitor memory with vmstat and top to get a better picture of the system memory 2. If Runtime.getRuntime().exec() is used, then transient memory allocations could trigger the OOM killer on a busy system 3. Make sure host resolution works properly, and turn it off in server.xml OK, enough rambling - hope this is useful. /mde/ --- On Wed, 2/3/10, Carl wrote: > From: Carl > Subject: Re: Tomcat dies suddenly > To: "Tomcat Users List" > Date: Wednesday, February 3, 2010, 5:07 PM > Chris, > > Interesting idea. I tried over the weekend to force > that situation with JMeter hitting a simple jsp that did > some data stuff and created a small display. I pushed > it to the point that there were entries in the log stating > it was out of memory (when attempting to GC, I think) but it > just slowed way down and never crashed. I could see > from VisualJVM that it had used the entire heap but, again, > I could never get it to crash. > > Strange because it doesn't have the classic signs (slowing > down or throwing out of memory exceptions or freezing), it > just disappears without any tracks. I am certain there > is a reason somewhere, I just haven't found it yet. > > Thanks for your suggestions, > > Carl --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org For additional commands, e-mail: users-help@tomcat.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org For additional commands, e-mail: users-help@tomcat.apache.org