couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Vander Wilt <nate-li...@calftrail.com>
Subject Re: Respawning server died, can't figure out why
Date Tue, 20 Aug 2013 14:48:59 GMT
On Aug 13, 2013, at 6:15 PM, Joan Touzet wrote:

> On Tue, Aug 13, 2013 at 02:49:28PM -0500, Nathan Vander Wilt wrote:
>> I've got 1.7GB disk free and 2GB of memory available at the moment, so it doesn't
seem to be either of those. (I could not find any out-of-memory process kill logs in /var/log/syslog.)
The only clue I can find is in couchdb.stderr:
>>    heart_beat_kill_pid = 1390
>>    heart_beat_timeout = 11
>>    heart: Tue Aug 13 18:34:21 2013: heart-beat time-out, no activity for 15 seconds
>>    Killed
> 
> So 15s of system clock time passed without erlang's heart receiving a
> ping back. There's a number of possibilities; for instance, if this is a
> VM and the clock was advanced/changed by 15s to synchronize with the
> main system, heart might see that and issue a kill command. Another
> could be extremely heavy load on the system forcing the second couch
> process to get swapped out.
> 
> Three suggestions:
> 
>  1. set RESPAWN_TIMEOUT to a non-zero value to force couch to restart
>     after a kill. Because of its crash-only design this is safe, and
>     since restarts are rare you're liable to not really be running
>     into serious issues.
>  2. Crank up logging to debug level to see what might be going on
>     when the heartbeat fails to respond.
>  3. Add some additional system monitoring to ensure that you're not
>     overloading your system on CPU, RAM, I/O or network traffic.
>     Do you have a lot of views building / heavy system load due to
>     couchjs processes?


Thanks for these suggestions, Joan. Unfortunately it seems the server is under quite the opposite
situation though: this is an m1.medium instance used as a dev server that spends most of its
time neglected ("System load:  0.0", "Memory usage: 18%") — mostly just the two CouchDB
daemons each getting a half dozen already-caught-up replications triggered every 10 minutes,
and one node.js server sitting waiting for someone to log in.

It looks like I am already using RESPAWN_TIMEOUT via the -r command line option. That's why
I was surprised the server stayed down for the hour or so it took for us to notice.

I'm guessing that to crank up logging I need to set _config/log/level to the string "debug",
and I will try that if this problem keeps reoccurring. I'm hesitant to simply set it now,
as I already know that running out of disk space has also caused CouchDB to fail to respawn
;-)

thx,
-nvw
Mime
View raw message