couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joan Touzet <woh...@apache.org>
Subject Re: Respawning server died, can't figure out why
Date Tue, 13 Aug 2013 23:15:31 GMT
On Tue, Aug 13, 2013 at 02:49:28PM -0500, Nathan Vander Wilt wrote:
> I've got 1.7GB disk free and 2GB of memory available at the moment, so it doesn't seem
to be either of those. (I could not find any out-of-memory process kill logs in /var/log/syslog.)
The only clue I can find is in couchdb.stderr:
>     heart_beat_kill_pid = 1390
>     heart_beat_timeout = 11
>     heart: Tue Aug 13 18:34:21 2013: heart-beat time-out, no activity for 15 seconds
>     Killed

So 15s of system clock time passed without erlang's heart receiving a
ping back. There's a number of possibilities; for instance, if this is a
VM and the clock was advanced/changed by 15s to synchronize with the
main system, heart might see that and issue a kill command. Another
could be extremely heavy load on the system forcing the second couch
process to get swapped out.

Three suggestions:

  1. set RESPAWN_TIMEOUT to a non-zero value to force couch to restart
     after a kill. Because of its crash-only design this is safe, and
     since restarts are rare you're liable to not really be running
     into serious issues.
  2. Crank up logging to debug level to see what might be going on
     when the heartbeat fails to respond.
  3. Add some additional system monitoring to ensure that you're not
     overloading your system on CPU, RAM, I/O or network traffic.
     Do you have a lot of views building / heavy system load due to
     couchjs processes?

-- 
Joan Touzet | joant@atypical.net | wohali everywhere else

Mime
View raw message