incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Cottlehuber <...@jsonified.com>
Subject Re: couchdb crashes silently
Date Tue, 12 Nov 2013 07:08:57 GMT
On 11. November 2013 at 23:10:38, Nathan Vander Wilt (nate-lists@calftrail.com) wrote:
>  
> Aaaaand this happened *again* over the weekend. This time I had  
> started CouchDB in a screen session, which was still running.  
> Again, it looked like both the shell script processes and the  
> beam one were both still running, just…no Couch.
>  
> I had debug logs going, the stdout records shows the logger dying  
> again but not with any unicode error type event, just the last  
> log:
> https://gist.github.com/natevw/dcd4a9a973da01270735  
>  
> There is some "heart: Sat Nov 9 08:35:30 2013: heart-beat time-out,  
> no activity for 26 seconds" in the stderr log but I'm not sure it's  
> related or not…there seem to be a few more heart-beat time-outs  
> than actual CouchDB server failures.

when heartbeat times out, the wrapper script kills & restarts BEAM - this is part of Erlang
VM stuff.

> Any concrete suggestions…? This sucks. I'm burnt out poking  
> through debug logs on this, I'm embarrassed and angry every time  
> I discover my sites have been down for another day or two because  
> of this, and adding another layer of twine and baling wire in the  
> form of a _second_ shell watchdog script is not at all exciting  
> >:-(
>  
> regards,
> -natevw

Remotely it’s hard to offer useful help (so many possibilities) but:

heart timeouts:
- long-running NIFs can do this by blocking the scheduler; especially do you have any large
JSON docs moving in or out? Its likely that reverting to R14B01 or B04 may resolve this, you’d
need to rebuild couchdb though.
- possibly HIPE. You should be able to uninstall erlang-base-hipe and install erlang-base
assuming its ubuntu (IIRC) to resolve that, rebooting couch reqd of course.
- there has been some mention of a possible resource leak in continuous replication (which
you have) but I’d not expect it to hang the BEAM, `just` crash it.
- are there any erlang turds (erl_crash.dump) lying around? they contain some useful debug
info.

resources:
You’ve not mentioned what’s the OS doing at this time — anything in /var/log/messages
or dmesg or whatever? is this a lean VPS or a “real box” getting timeouts? Personally
I’d install collectd or ganglia etc, pump general OS metrics out to graphite for comparing,
and also start collecting some erlang vm metrics too, wrt to the possible resource leak.

I’ve not looked at recon for this http://ferd.github.io/recon/ but it could be useful;
right now I’d pick https://github.com/jsonified/estatsd and send erlang vm stats to graphite,
same as the OS stuff, and see what comes up when these issues occur.

A+
Dave


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message