incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Vander Wilt <nate-li...@calftrail.com>
Subject Re: couchdb crashes silently
Date Mon, 11 Nov 2013 22:10:04 GMT
Aaaaand this happened *again* over the weekend. This time I had started CouchDB in a screen
session, which was still running. Again, it looked like both the shell script processes and
the beam one were both still running, just…no Couch.

I had debug logs going, the stdout records shows the logger dying again but not with any unicode
error type event, just the last log:
https://gist.github.com/natevw/dcd4a9a973da01270735

There is some "heart: Sat Nov  9 08:35:30 2013: heart-beat time-out, no activity for 26 seconds"
in the stderr log but I'm not sure it's related or not…there seem to be a few more heart-beat
time-outs than actual CouchDB server failures.

Any concrete suggestions…? This sucks. I'm burnt out poking through debug logs on this,
I'm embarrassed and angry every time I discover my sites have been down for another day or
two because of this, and adding another layer of twine and baling wire in the form of a _second_
shell watchdog script is not at all exciting >:-(

regards,
-natevw



On Nov 1, 2013, at 9:17 AM, Nathan Vander Wilt <nate-lists@calftrail.com> wrote:

> 
> On Nov 1, 2013, at 12:10 AM, Dave Cottlehuber <dch@jsonified.com> wrote:
> 
>>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > 
>>> wrote:
>>> 
>>> Aaaand my Couch commited suicide again today. Unless this is  
>>> something different, I may have finally gotten lucky and had  
>>> CouchDB leave a note [eerily unfinished!] in the logs this time:  
>>> https://gist.github.com/natevw/fd509978516499ba128b  
>>> 
>>> ```
>>> ** Reason == {badarg,
>>> [{io,put_chars,
>>> [<0.93.0>,unicode,
>>> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216
 
>>> - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
 
>>> 200\n">>],
>>> []},
>>> ```
>>> 
>>> So…now what? I have a rebuilt version of CouchDB I'm going to try  
>>> [once I figure out why *it* isn't starting] but this is still really  
>>> upsetting — I'm aware I could add my own cronjob or something to  
>>> check and restart if needed every minute, but a) the shell script  
>>> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is  
>>> embarrassing and aggravating.
>>> 
>>> thanks,
>>> -natevw
>> 
>> So there’s 2 things here
>> 
>> - why the couch doesn’t get restarted?
>> 
>> Sounds very much like the afore mentioned pid race condition. Wendall do you know
any more about this? I thought you had some ideas about it IIRC.
>> 
> 
> 
> I think I figured out the answer to this one, at least in the latest crash. The Erlang
process the shell script watches was still running, just not accepting connections. I didn't
notice this the previous times, though…I only realized it this time because when I went
to restart the shell script acted like it was already running. So maybe there's actually two
crashes, one silent heartbeat one and this unicode?
> 
> 
> 
>> - why io:putchars/2 has trouble writing to a boring log file, which obviously works
most of the time.
>> 
>> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>
>> 
>> io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s
no io:put_chars/3. 
>> 
>> This unicode looks weird and from a quick look I can’t see where it should come
from.
>> 
>> Can you get more of the logfile (like hundreds of lines) and stick it somewhere?
email is fine.
>> 
>> I’d like to see what happens to <0.93.0> (the process wrapping the log fd),
and also if the unicode atom turns up anywhere else prior.
> 
> 
> You want more of the log *up to* the crash? Because I have nothing *beyond* what is in
that gist, that's the thing! The end of the log was cut off, I did not snip it. The log as
it sits now has these exact lines in it:
> 
> ```
>                             {line,173}]},
>                           {gen_event,ser
> Apache CouchDB 1.4.0 (LogLevel=info) is starting.
> ```
> 
> (The subsequent "starting" is due to my intervention.)
> 
> -nvw


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message