Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
From: Nathan Vander Wilt <nate-lists@calftrail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768"
Message-Id: <585606F8-9061-48F8-973D-5F0DA6E8347F@calftrail.com>
Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\))
Subject: Re: couchdb crashes silently
Date: Fri, 1 Nov 2013 09:10:52 -0700
References: <20130913222006.GD2125@translab.its.uci.edu>
 <83F1EED8-93FA-43D1-93C6-777F6311F46E@calftrail.com>
 <8F734BE7-7AC1-4E47-88CB-F61AF7A1D06E@calftrail.com>,<16546A8D-5D74-4B4E-BC1C-63ABAFC3E3D7@calftrail.com>
 <8B3AB497-9A05-4934-AFB1-06041591A7F7@sri.com>
To: user@couchdb.apache.org
In-Reply-To: <8B3AB497-9A05-4934-AFB1-06041591A7F7@sri.com>

--Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

Yes, the bootstrap shell script is broken. I filed =
https://issues.apache.org/jira/browse/COUCHDB-1885 but that has a stupid =
title and doesn't quite capture how broken it is. Basically, some of the =
-k/-s logic got borked a while back and so IIRC you can't request a =
graceful restart of CouchDB via the shell script (you have to kill the =
beam process *yourself* and then the script will reload it).

That aside, I don't think that is related in this case. At least the =
last time this instance went down, the Erlang process _was still =
running_ just not accepting network connections. So from the shell =
script's perspective, it didn't see the need to restart.

hth,
-natevw


On Oct 31, 2013, at 9:30 PM, Jim Klo <jim.klo@sri.com> wrote:

> I noticed this myself (the bootstrap shell script not working). I =
vaguely recall that determining that the watchdog process doesn't =
correctly monitor the pid file. The logic in general was off - basically =
there's an edge condition not accounted for. I don't remember if I fixed =
the script or not - I'd have to hunt through my notes when I get back to =
a real computer. Something tells me I wrapped in a cron to clean up and =
restart as I was under a timeline before the world nearly came to an end =
earlier this month.
>=20
> Jim Klo
> Senior Software Engineer
> SRI International
> t: @nsomnac
>=20
> On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" =
<nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>> wrote:
>=20
> Okay, may have figured out why the shell script isn't restarting =
Couch. It seems it may not actually die all the way. I can't connect to =
it, but there is a process matching the pidfile:
>=20
> 6417 ?        Sl    14:34 =
/home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 =
-- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home =
/home/ubuntu -- -noshell -noinput -os_mon start_memsup false =
start_cpu_sup false disk_space_check_interval 1 =
disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini =
bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch =
-pidfile production_couch/couch.pid -heart
>=20
> hth,
> -nvw
>=20
>=20
>=20
> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt =
<nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>> wrote:
>=20
> Aaaand my Couch commited suicide again today. Unless this is something =
different, I may have finally gotten lucky and had CouchDB leave a note =
[eerily unfinished!] in the logs this time:
> https://gist.github.com/natevw/fd509978516499ba128b
>=20
> ```
> ** Reason =3D=3D {badarg,
>                [{io,put_chars,
>                     [<0.93.0>,unicode,
>                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] =
[<0.31789.2>] 66.249.66.216 - - GET =
/public/_design/glob/_list/posts/by_path?key=3D%5B%222012%22%2C%2203%22%2C=
%22metakaolin_geojson_editor%22%5D&include_docs=3Dtrue&path1=3D2012&path2=3D=
03&path3=3Dmetakaolin_geojson_editor 200\n">>],
>                     []},
> ```
>=20
> So=85now what? I have a rebuilt version of CouchDB I'm going to try =
[once I figure out why *it* isn't starting] but this is still really =
upsetting =97 I'm aware I could add my own cronjob or something to check =
and restart if needed every minute, but a) the shell script is SUPPOSED =
to be keeping CouchDB and b) it's NOT and c) this is embarrassing and =
aggravating.
>=20
> thanks,
> -natevw
>=20
>=20
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt =
<nate-lists@calftrail.com<mailto:nate-lists@calftrail.com>> wrote:
>=20
> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 =
[=85output and configuration options=85]` and keep pulling up my sites =
finding them dead too. Seems to be about the same thing as others are =
reporting in this old thread=85was there any resolution?
>=20
> This is not an OOM thing, in dmesg I do see some killed processes =
(node) but never couchdb/beam and NOTHING killed after I added swap some =
several days ago. CouchDB was dead again this morning.
>=20
> The only trace of trouble in the logs is in couch.stderr:
>=20
> ```
> heart_beat_kill_pid =3D 32575
> heart_beat_timeout =3D 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for =
12 seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed =
"/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>=20
> heart_beat_kill_pid =3D 13781
> heart_beat_timeout =3D 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for =
15 seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed =
"/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>=20
> heart_beat_kill_pid =3D 15292
> heart_beat_timeout =3D 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for =
14 seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed =
"/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>=20
> heart_beat_kill_pid =3D 29158
> heart_beat_timeout =3D 11
> ```
>=20
> 1. What are these "heart-beat time-out" logs about? Is that a clue to =
the trouble?
> 2. Regardless, why isn't the shell script restarting CouchDB after 5 =
seconds like I told it to?
>=20
> `erlang:display(erlang:system_info(otp_release)).`  says R15B
>=20
> thanks,
> -natevw
>=20
>=20
>=20
> On Sep 13, 2013, at 3:20 PM, James Marca =
<jmarca@translab.its.uci.edu<mailto:jmarca@translab.its.uci.edu>> wrote:
>=20
> I am seeing a lot of random, silent crashes on just *one* of my
> CouchDB servers.
>=20
> couchdb version 1.4.0 (gentoo ebuild)
>=20
> erlang also from gentoo ebuild:
> Erlang (BEAM) emulator version 5.10.2
> Compiled on Fri Sep 13 08:39:20 2013
> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:false]
>=20
> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> All of them are replicating a single db between them, with B acting as
> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> B.
>=20
> All three servers have data crunching jobs running that are reading
> and writing to the database that is being replicated around.
>=20
> The B server, the one in the middle that is push replicating to both A
> and C, is the one that is crashing.
>=20
> The log looks like this:
>=20
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - =
GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - =
GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has =
started on http://0.0.0.0:5984/
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start =
replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` =
(document `carb_grid_state4k_push_emma64`).
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start =
replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document =
`carb_grid_state5k_hpms_push_kitty`).
>=20
> no warning that the server died or why, and nothing in the
> /var/log/messages about anything untoward  happening (no OOM killer
> invoked or anything like that)
>=20
> The restart only happened because I manually did a
> /etc/init.d/couchdb restart
> Usually couchdb restarts itself, but not with this crash.
>=20
>=20
>=20
> I flipped the log to debug level, and still had no warning about the =
crash:
>=20
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' =
/carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
> Headers: [{'Accept',"application/json"},
>       {'Authorization',"Basic =
amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM=3D"},
>       {'Content-Length',"346"},
>       {'Content-Type',"application/json"},
>       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>       {'User-Agent',"CouchDB/1.4.0"},
>       {"X-Couch-Full-Commit","false"}]
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing =
doc batch of size 128531 bytes
>=20
> And that was it.  CouchDB was down and out.
>=20
> I even tried shutting off the data processing (so as to reduce the db
> load) on box B, but that didn't help (all the crashing has put it far
> behind in replicating box A and C).
>=20
> My guess is that the replication load is too big (too many
> connections, too much data being pushed in), but I would expect some
> sort of warning before the server dies.
>=20
> Any clues or suggestions would be appreciated.  I am currently going
> to try compling from source directly, but I don't have much faith that
> it will make a difference.
>=20
> Thanks,
> James Marca
>=20
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>=20
>=20
>=20
>=20


--Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768--