tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Recovery from OutOfMemoryError?
Date Wed, 01 Aug 2007 13:44:02 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chuck,

Caldarale, Charles R wrote:
>> From: Christopher Schultz [mailto:chris@christopherschultz.net] 
>> Subject: Re: Recovery from OutOfMemoryError?
> 
> (Sorry for not responding sooner.  Went out to dinner and to see the
> Spider Pig movie :-)

Nice. ;)

>> Actually, my past experience has been that it's the GC
>> thread that OOMEs, not a worker thread.
> 
> Assuming we're talking about a current HotSpot-based JVM, the threads
> doing GCs cannot get OOMEs, since they are dedicated to doing just GC
> operations, and never do any object allocations themselves.  On older
> JVMs (and some from other vendors), the thread that initially encounters
> an allocation failure also does the GC; if the GC fails to recover
> enough memory, it can generate an OOME for itself.

Like I said, it's been a loooong time since I've had to worry about
OOMEs that didn't result from honestly having too small of a heap to
handle the program's needs. It was probably a 1.3 JVM or something like
that.

>> It has always been my understanding that a JVM that suffers an OOME
>> is all but done for.
> 
> The JVM itself doesn't care about any exceptions thrown at the
> application.  There are certainly a ton of applications that handle such
> error conditions very badly, and hang themselves up by doing such things
> as trying to display messages rather than nulling out now useless
> references.  Some of the stress-testing of our JVM involves running apps
> designed to provoke OOMEs; these readily recover and keep on truckin'.

Right. Which JVM are you working on, though? One of the mainstream ones?
Or something designed to be super high-availability (not that the
mainstream ones aren't...)?

>> The OP would seem to corroborate this claim, since it sounds like his
>> whole app server becomes unresponsive once he gets an OOME (hence the
>> early morning phone calls).
> 
> The supposed timing of the phone calls leaves me somewhat skeptical;
> what are they running where the peak load occurs at 3 AM?

I had thought of that, and it didn't make a whole lot of sense to me.
The only conclusion that I could draw was that some user (or several
users) caused the OOME and permanently disabled the server. At 03:00 (or
so) other users, perhaps in a different timezone, started trying to use
the server and found it unresponsive. Then again, maybe he runs an adult
website that gets most of its traffic at 3 in the morning. If not,
whoever he works for needs to get a more geographically diverse tech
support team ;)

>> If your assertion (OOMEs can be ignored, since only one allocation 
>> fails and the rest of the VM is fine) were true, then the OP would
>> not be getting any calls in the middle of the night: the user would
>> simply re-try the request and (hopefully) get a result the second
> time.
> 
> That's not what I said at all.

Sorry. I was trying to recap a nuanced position in a single sentence.

> Each logical module should be designed
> to handle such situations, typically by discarding what has been done up
> to the point of failure, and then returning an error to its caller.
> What is likely to have happened instead in the OP's case is that the app
> encountering the OOME had no provision at all for error recovery, and
> simply quit, leaving many now useless objects around with live
> references to them.  It may have even made matters worse by trying to
> generate an error message of some sort.

I'm guessing he's running a webapp, and that one of the request worker
threads got an OOME. Most webapp requests are idempotent (or should be),
and those that aren't are generally wrapped around database or other
transactions. Assuming I'm right (which is frequently dangerous), one
failed request should not affect the rest of the application. Any
locally-instantiated objects should be ripe for collection, including
any of the "big ones" that probably caused the OOME in the first place.
The server should keep going, right?

For some reason, it doesn't. Maybe he's busting his PermGen, but that's
unlikely since he says it only happens under peak load. So, what is the
likely cause of the tech support call? The server must have gone down,
right? If it wasn't the servlet, and it wasn't Tomcat, and it wasn't the
JVM, what brought caused the outage?

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGsI4i9CaO5/Lv0PARAiFZAJoCEmn46zAr01MbSYygabxyHMR7uACgjMoG
BruXyXOAzRPhJYY7M/0R0qQ=
=ejah
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message