Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DFE14CD36 for ; Fri, 1 Nov 2013 16:11:40 +0000 (UTC) Received: (qmail 40926 invoked by uid 500); 1 Nov 2013 16:11:30 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 40859 invoked by uid 500); 1 Nov 2013 16:11:26 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 40833 invoked by uid 99); 1 Nov 2013 16:11:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Nov 2013 16:11:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.160.47] (HELO mail-pb0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Nov 2013 16:11:17 +0000 Received: by mail-pb0-f47.google.com with SMTP id rq13so372095pbb.20 for ; Fri, 01 Nov 2013 09:10:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=AfDO4Mt0HO/LVOcx56zXvrZ9NvsdMalZJRsemFwawp0=; b=S2I4YtWn3n/s/l1G8gQEDYNWsrfm8dpcKryajACv4rD2elidQHFyVq+5oRmxSM2mRr GUxTu9B9BBn2hZ64UM2Qao/OVSw6zQ75WbE41NpQ/pIBFSw4DoaEk/KQtY/pqTP7CGg7 13FT8K1bBDAVCl+lGQeM7SB6IfwzfK9m9JXeGsJZBV9TRZVdKwvETshLLwpzVT+zXPb/ vR6svucVPBzNfVRJ4xoct6uJKTs83Y3gWO+B/q1WjrPA7AM/t9AbWt1LCnAncbdYylsn s9Ja1hv9BSEHz+sysfvLRLdCHmUUgLmC8aWO5e6Lp12oOitFFs1fZOty6B2uAe4iDhuE /Dgw== X-Gm-Message-State: ALoCoQkgQGSOWWLpmKZiNZuldjkjcU0T/MzQb4+0PFfIT24slSB9NHuPK2fPQttZrSvUkd+87C+P X-Received: by 10.68.225.232 with SMTP id rn8mr3880581pbc.32.1383322252804; Fri, 01 Nov 2013 09:10:52 -0700 (PDT) Received: from [192.168.1.112] (24-216-224-114.static.mdfd.or.charter.com. [24.216.224.114]) by mx.google.com with ESMTPSA id sy10sm13899720pac.15.2013.11.01.09.10.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 01 Nov 2013 09:10:52 -0700 (PDT) From: Nathan Vander Wilt Content-Type: multipart/alternative; boundary="Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768" Message-Id: <585606F8-9061-48F8-973D-5F0DA6E8347F@calftrail.com> Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\)) Subject: Re: couchdb crashes silently Date: Fri, 1 Nov 2013 09:10:52 -0700 References: <20130913222006.GD2125@translab.its.uci.edu> <83F1EED8-93FA-43D1-93C6-777F6311F46E@calftrail.com> <8F734BE7-7AC1-4E47-88CB-F61AF7A1D06E@calftrail.com>,<16546A8D-5D74-4B4E-BC1C-63ABAFC3E3D7@calftrail.com> <8B3AB497-9A05-4934-AFB1-06041591A7F7@sri.com> To: user@couchdb.apache.org In-Reply-To: <8B3AB497-9A05-4934-AFB1-06041591A7F7@sri.com> X-Mailer: Apple Mail (2.1816) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Yes, the bootstrap shell script is broken. I filed = https://issues.apache.org/jira/browse/COUCHDB-1885 but that has a stupid = title and doesn't quite capture how broken it is. Basically, some of the = -k/-s logic got borked a while back and so IIRC you can't request a = graceful restart of CouchDB via the shell script (you have to kill the = beam process *yourself* and then the script will reload it). That aside, I don't think that is related in this case. At least the = last time this instance went down, the Erlang process _was still = running_ just not accepting network connections. So from the shell = script's perspective, it didn't see the need to restart. hth, -natevw On Oct 31, 2013, at 9:30 PM, Jim Klo wrote: > I noticed this myself (the bootstrap shell script not working). I = vaguely recall that determining that the watchdog process doesn't = correctly monitor the pid file. The logic in general was off - basically = there's an edge condition not accounted for. I don't remember if I fixed = the script or not - I'd have to hunt through my notes when I get back to = a real computer. Something tells me I wrapped in a cron to clean up and = restart as I was under a timeline before the world nearly came to an end = earlier this month. >=20 > Jim Klo > Senior Software Engineer > SRI International > t: @nsomnac >=20 > On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" = > wrote: >=20 > Okay, may have figured out why the shell script isn't restarting = Couch. It seems it may not actually die all the way. I can't connect to = it, but there is a process matching the pidfile: >=20 > 6417 ? Sl 14:34 = /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 = -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home = /home/ubuntu -- -noshell -noinput -os_mon start_memsup false = start_cpu_sup false disk_space_check_interval 1 = disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini = bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch = -pidfile production_couch/couch.pid -heart >=20 > hth, > -nvw >=20 >=20 >=20 > On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt = > wrote: >=20 > Aaaand my Couch commited suicide again today. Unless this is something = different, I may have finally gotten lucky and had CouchDB leave a note = [eerily unfinished!] in the logs this time: > https://gist.github.com/natevw/fd509978516499ba128b >=20 > ``` > ** Reason =3D=3D {badarg, > [{io,put_chars, > [<0.93.0>,unicode, > <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] = [<0.31789.2>] 66.249.66.216 - - GET = /public/_design/glob/_list/posts/by_path?key=3D%5B%222012%22%2C%2203%22%2C= %22metakaolin_geojson_editor%22%5D&include_docs=3Dtrue&path1=3D2012&path2=3D= 03&path3=3Dmetakaolin_geojson_editor 200\n">>], > []}, > ``` >=20 > So=85now what? I have a rebuilt version of CouchDB I'm going to try = [once I figure out why *it* isn't starting] but this is still really = upsetting =97 I'm aware I could add my own cronjob or something to check = and restart if needed every minute, but a) the shell script is SUPPOSED = to be keeping CouchDB and b) it's NOT and c) this is embarrassing and = aggravating. >=20 > thanks, > -natevw >=20 >=20 > On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt = > wrote: >=20 > I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 = [=85output and configuration options=85]` and keep pulling up my sites = finding them dead too. Seems to be about the same thing as others are = reporting in this old thread=85was there any resolution? >=20 > This is not an OOM thing, in dmesg I do see some killed processes = (node) but never couchdb/beam and NOTHING killed after I added swap some = several days ago. CouchDB was dead again this morning. >=20 > The only trace of trouble in the logs is in couch.stderr: >=20 > ``` > heart_beat_kill_pid =3D 32575 > heart_beat_timeout =3D 11 > heart: Sat Oct 5 02:59:16 2013: heart-beat time-out, no activity for = 12 seconds > Killed > heart: Sat Oct 5 02:59:18 2013: Executed = "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >=20 > heart_beat_kill_pid =3D 13781 > heart_beat_timeout =3D 11 > heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for = 15 seconds > Killed > heart: Tue Oct 22 19:51:11 2013: Executed = "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >=20 > heart_beat_kill_pid =3D 15292 > heart_beat_timeout =3D 11 > heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for = 14 seconds > Killed > heart: Tue Oct 29 12:33:18 2013: Executed = "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >=20 > heart_beat_kill_pid =3D 29158 > heart_beat_timeout =3D 11 > ``` >=20 > 1. What are these "heart-beat time-out" logs about? Is that a clue to = the trouble? > 2. Regardless, why isn't the shell script restarting CouchDB after 5 = seconds like I told it to? >=20 > `erlang:display(erlang:system_info(otp_release)).` says R15B >=20 > thanks, > -natevw >=20 >=20 >=20 > On Sep 13, 2013, at 3:20 PM, James Marca = > wrote: >=20 > I am seeing a lot of random, silent crashes on just *one* of my > CouchDB servers. >=20 > couchdb version 1.4.0 (gentoo ebuild) >=20 > erlang also from gentoo ebuild: > Erlang (BEAM) emulator version 5.10.2 > Compiled on Fri Sep 13 08:39:20 2013 > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] > [async-threads:10] [kernel-poll:false] >=20 > I've got 3 servers running couchdb, A, B, C, and only B is crashing. > All of them are replicating a single db between them, with B acting as > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to > B. >=20 > All three servers have data crunching jobs running that are reading > and writing to the database that is being replicated around. >=20 > The B server, the one in the middle that is push replicating to both A > and C, is the one that is crashing. >=20 > The log looks like this: >=20 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - = GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - = GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has = started on http://0.0.0.0:5984/ > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start = replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` = (document `carb_grid_state4k_push_emma64`). > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start = replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document = `carb_grid_state5k_hpms_push_kitty`). >=20 > no warning that the server died or why, and nothing in the > /var/log/messages about anything untoward happening (no OOM killer > invoked or anything like that) >=20 > The restart only happened because I manually did a > /etc/init.d/couchdb restart > Usually couchdb restarts itself, but not with this crash. >=20 >=20 >=20 > I flipped the log to debug level, and still had no warning about the = crash: >=20 > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' = /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" > Headers: [{'Accept',"application/json"}, > {'Authorization',"Basic = amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM=3D"}, > {'Content-Length',"346"}, > {'Content-Type',"application/json"}, > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, > {'User-Agent',"CouchDB/1.4.0"}, > {"X-Couch-Full-Commit","false"}] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing = doc batch of size 128531 bytes >=20 > And that was it. CouchDB was down and out. >=20 > I even tried shutting off the data processing (so as to reduce the db > load) on box B, but that didn't help (all the crashing has put it far > behind in replicating box A and C). >=20 > My guess is that the replication load is too big (too many > connections, too much data being pushed in), but I would expect some > sort of warning before the server dies. >=20 > Any clues or suggestions would be appreciated. I am currently going > to try compling from source directly, but I don't have much faith that > it will make a difference. >=20 > Thanks, > James Marca >=20 > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. >=20 >=20 >=20 >=20 --Apple-Mail=_2FC34E02-B70E-4F13-A78A-288448B8F768--