www-infrastructure-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Husted (JIRA)" <j...@apache.org>
Subject [jira] Created: (INFRA-1334) Intermittent Out Of Memory failures
Date Fri, 24 Aug 2007 12:27:32 GMT
Intermittent Out Of Memory failures
-----------------------------------

                 Key: INFRA-1334
                 URL: https://issues.apache.org/jira/browse/INFRA-1334
             Project: Infrastructure
          Issue Type: Bug
      Security Level: public (Regular issues)
          Components: JIRA
            Reporter: Ted Husted


[Summarized from an email thread]

Occasionally, JIRA on Brutus will fail. Sometimes, it seems to "spin out of control", knocking
 off other systems, including Bugzilla. Though, other times, it's been observed that the other
JIRA instances are responsive even when one fails. 

It's possible that we are seeing both OOMs and PermGen OOMs. The most recent set were due
to some kind of client tool sending a malformed request that ended up requesting huge amounts
of data, and consuming large amounts of memory. It's not clear whether the memory usage is
a leak or because something that isn't coded to pipeline.

Client tools account for half of the current JIRA traffic.

 % wc -l issues_weblog 
 738289
 % grep -c soapservice issues_weblog 
 372529

The underlying problem may be that ill-conceived search requests can consume too much memory
and bring the system down. (For other services, like Subversion, we've found ways to keep
the CPUs in check.)

These requests are being made by tools like mylyn, which some developers find very useful.
The tools increase integration between development environments and the issue tracker. A page
like this

 * http://velocity.apache.org/engine/releases/velocity-1.5/jira-report.html

is being built straight from the issue tracker.

It is important that we take steps, since we can't trust a system that's starting to throw
OOMs, since it could potentially lead to bad data.

Since there seems to be more than one type of error condition, whenever a JIRA instance goes
down, we should check is whether other instances are responsive. We need, for example, to
isolate issues with the httpd front end. 

We might want to look at setting up the Java Service Wrapper (http://wrapper.tanukisoftware.org)
for our JVM(s) (its available for both Linux and Solaris). One thing the wrapper can do is
monitor the JVM output for such things as OOM messages. Having JWS shut the JVM down and send
a mail might help us isolate the problem. If possible, it might be helpful if the wrapper
could monitoring JMX events within the monitored JVM, so that the Wrapper can take pro-active
action.

The JIRA searches are based on Lucene. We might want to consider asking one of our ASF Lucene
experts to review the Atlassian source code. 

Aside from searches, another cause could be something like DNS resolution via InetAddress?
This is a known issue with Java's DNS cache that can cause memory leaks for each IP that hit
the server. Any server application that does DNS resolution should be using dnsjava, which
has a proper TTL-managed cache. There is also a JVM specific workaround that can be used in
the meantime.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message