incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Vander Wilt <nate-li...@calftrail.com>
Subject Re: couchdb crashes silently
Date Fri, 01 Nov 2013 16:10:52 GMT
Yes, the bootstrap shell script is broken. I filed https://issues.apache.org/jira/browse/COUCHDB-1885
but that has a stupid title and doesn't quite capture how broken it is. Basically, some of
the -k/-s logic got borked a while back and so IIRC you can't request a graceful restart of
CouchDB via the shell script (you have to kill the beam process *yourself* and then the script
will reload it).

That aside, I don't think that is related in this case. At least the last time this instance
went down, the Erlang process _was still running_ just not accepting network connections.
So from the shell script's perspective, it didn't see the need to restart.

hth,
-natevw


On Oct 31, 2013, at 9:30 PM, Jim Klo <jim.klo@sri.com> wrote:

> I noticed this myself (the bootstrap shell script not working). I vaguely recall that
determining that the watchdog process doesn't correctly monitor the pid file. The logic in
general was off - basically there's an edge condition not accounted for. I don't remember
if I fixed the script or not - I'd have to hunt through my notes when I get back to a real
computer. Something tells me I wrapped in a cron to clean up and restart as I was under a
timeline before the world nearly came to an end earlier this month.
> 
> Jim Klo
> Senior Software Engineer
> SRI International
> t: @nsomnac
> 
> On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" <nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>>
wrote:
> 
> Okay, may have figured out why the shell script isn't restarting Couch. It seems it may
not actually die all the way. I can't connect to it, but there is a process matching the pidfile:
> 
> 6417 ?        Sl    14:34 /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd
-K true -A 4 -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu
-- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval
1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini bc2/build/etc/couchdb/default.ini
production_couch/local.ini -s couch -pidfile production_couch/couch.pid -heart
> 
> hth,
> -nvw
> 
> 
> 
> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>>
wrote:
> 
> Aaaand my Couch commited suicide again today. Unless this is something different, I may
have finally gotten lucky and had CouchDB leave a note [eerily unfinished!] in the logs this
time:
> https://gist.github.com/natevw/fd509978516499ba128b
> 
> ```
> ** Reason == {badarg,
>                [{io,put_chars,
>                     [<0.93.0>,unicode,
>                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>]
66.249.66.216 - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
200\n">>],
>                     []},
> ```
> 
> So…now what? I have a rebuilt version of CouchDB I'm going to try [once I figure out
why *it* isn't starting] but this is still really upsetting — I'm aware I could add my own
cronjob or something to check and restart if needed every minute, but a) the shell script
is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is embarrassing and aggravating.
> 
> thanks,
> -natevw
> 
> 
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>>
wrote:
> 
> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration
options…]` and keep pulling up my sites finding them dead too. Seems to be about the same
thing as others are reporting in this old thread…was there any resolution?
> 
> This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam
and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.
> 
> The only trace of trouble in the logs is in couch.stderr:
> 
> ```
> heart_beat_kill_pid = 32575
> heart_beat_timeout = 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" ->
256. Terminating.
> 
> heart_beat_kill_pid = 13781
> heart_beat_timeout = 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" ->
256. Terminating.
> 
> heart_beat_kill_pid = 15292
> heart_beat_timeout = 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" ->
256. Terminating.
> 
> heart_beat_kill_pid = 29158
> heart_beat_timeout = 11
> ```
> 
> 1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told
it to?
> 
> `erlang:display(erlang:system_info(otp_release)).`  says R15B
> 
> thanks,
> -natevw
> 
> 
> 
> On Sep 13, 2013, at 3:20 PM, James Marca <jmarca@translab.its.uci.edu<mailto:jmarca@translab.its.uci.edu>>
wrote:
> 
> I am seeing a lot of random, silent crashes on just *one* of my
> CouchDB servers.
> 
> couchdb version 1.4.0 (gentoo ebuild)
> 
> erlang also from gentoo ebuild:
> Erlang (BEAM) emulator version 5.10.2
> Compiled on Fri Sep 13 08:39:20 2013
> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:false]
> 
> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> All of them are replicating a single db between them, with B acting as
> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> B.
> 
> All three servers have data crunching jobs running that are reading
> and writing to the database that is being replicated around.
> 
> The B server, the one in the middle that is push replicating to both A
> and C, is the one that is crashing.
> 
> The log looks like this:
> 
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00
404
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00
404
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on
http://0.0.0.0:5984/
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication
`84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication
`e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).
> 
> no warning that the server died or why, and nothing in the
> /var/log/messages about anything untoward  happening (no OOM killer
> invoked or anything like that)
> 
> The restart only happened because I manually did a
> /etc/init.d/couchdb restart
> Usually couchdb restarts itself, but not with this crash.
> 
> 
> 
> I flipped the log to debug level, and still had no warning about the crash:
> 
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs
{1,1} from "128.xxx.xx.yy"
> Headers: [{'Accept',"application/json"},
>       {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>       {'Content-Length',"346"},
>       {'Content-Type',"application/json"},
>       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>       {'User-Agent',"CouchDB/1.4.0"},
>       {"X-Couch-Full-Commit","false"}]
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of
size 128531 bytes
> 
> And that was it.  CouchDB was down and out.
> 
> I even tried shutting off the data processing (so as to reduce the db
> load) on box B, but that didn't help (all the crashing has put it far
> behind in replicating box A and C).
> 
> My guess is that the replication load is too big (too many
> connections, too much data being pushed in), but I would expect some
> sort of warning before the server dies.
> 
> Any clues or suggestions would be appreciated.  I am currently going
> to try compling from source directly, but I don't have much faith that
> it will make a difference.
> 
> Thanks,
> James Marca
> 
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message