Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0A5B71077C for ; Sat, 14 Sep 2013 10:01:00 +0000 (UTC) Received: (qmail 40915 invoked by uid 500); 14 Sep 2013 10:00:56 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 40592 invoked by uid 500); 14 Sep 2013 10:00:55 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 40584 invoked by uid 99); 14 Sep 2013 10:00:53 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Sep 2013 10:00:53 +0000 Received: from localhost (HELO [192.168.1.4]) (127.0.0.1) (smtp-auth username rnewson, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Sep 2013 10:00:53 +0000 From: Robert Newson Content-Type: multipart/signed; boundary="Apple-Mail=_DFFB5D43-43DC-4237-85C9-1689C3609D2A"; protocol="application/pgp-signature"; micalg=pgp-sha512 Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: couchdb crashes silently Date: Sat, 14 Sep 2013 11:00:49 +0100 References: <20130913222006.GD2125@translab.its.uci.edu> To: user@couchdb.apache.org In-Reply-To: <20130913222006.GD2125@translab.its.uci.edu> X-Mailer: Apple Mail (2.1510) --Apple-Mail=_DFFB5D43-43DC-4237-85C9-1689C3609D2A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii We should really remove that init.d daemon script and replace it with = runit. That way you a) are guaranteed a restart on crash and b) = stdout/err is automatically captured (and rotated). In my experience the = stdout/err in these events is very useful. To switch, you need runit = (obviously) and then a short stanza that starts couchdb in the = foreground, there's a switch for that. Alternatively, start in the = foreground in a terminal (as the couchdb user) and pound the server = until it crashes. I've no operational experience with R16 series, unfortunately. All I do = know is, since R15, the new process scheduler can interact poorly with = NIF's that perform work lasting over a millisecond, which I could = imagine happening for JSON encoding/decoding of large documents. If it were a running out of file descriptors or sockets situation, I = would expect some useful noise in the log, but we can't rule it out yet. B. On 13 Sep 2013, at 23:20, James Marca = wrote: > I am seeing a lot of random, silent crashes on just *one* of my > CouchDB servers. >=20 > couchdb version 1.4.0 (gentoo ebuild) >=20 > erlang also from gentoo ebuild:=20 > Erlang (BEAM) emulator version 5.10.2 > Compiled on Fri Sep 13 08:39:20 2013 > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] > [async-threads:10] [kernel-poll:false] >=20 > I've got 3 servers running couchdb, A, B, C, and only B is crashing. > All of them are replicating a single db between them, with B acting as > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to > B. >=20 > All three servers have data crunching jobs running that are reading > and writing to the database that is being replicated around. >=20 > The B server, the one in the middle that is push replicating to both A > and C, is the one that is crashing. >=20 > The log looks like this: >=20 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - = GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - = GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has = started on http://0.0.0.0:5984/ > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start = replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` = (document `carb_grid_state4k_push_emma64`). > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start = replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document = `carb_grid_state5k_hpms_push_kitty`). >=20 > no warning that the server died or why, and nothing in the > /var/log/messages about anything untoward happening (no OOM killer > invoked or anything like that) >=20 > The restart only happened because I manually did a=20 > /etc/init.d/couchdb restart > Usually couchdb restarts itself, but not with this crash. >=20 >=20 >=20 > I flipped the log to debug level, and still had no warning about the = crash: >=20 > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' = /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" > Headers: [{'Accept',"application/json"}, > {'Authorization',"Basic = amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM=3D"}, > {'Content-Length',"346"}, > {'Content-Type',"application/json"}, > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, > {'User-Agent',"CouchDB/1.4.0"}, > {"X-Couch-Full-Commit","false"}] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing = doc batch of size 128531 bytes >=20 > And that was it. CouchDB was down and out. >=20 > I even tried shutting off the data processing (so as to reduce the db > load) on box B, but that didn't help (all the crashing has put it far > behind in replicating box A and C). >=20 > My guess is that the replication load is too big (too many > connections, too much data being pushed in), but I would expect some > sort of warning before the server dies. =20 >=20 > Any clues or suggestions would be appreciated. I am currently going > to try compling from source directly, but I don't have much faith that > it will make a difference. >=20 > Thanks, > James Marca >=20 > --=20 > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. >=20 --Apple-Mail=_DFFB5D43-43DC-4237-85C9-1689C3609D2A Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJSNDPUAAoJEBAV9o+doki8dy0P/1AR8iNAp0ysXZFgovTwv43F sqfGYrj79A/0ayiiudWPE8+94KMK2bcJ1EGXXO3k+lxMRsmVd+xBk9X/5fydcckm Q6ioHcmnh0vh/nQRQbLCaNTwL/VVUn3YvTaJuQy7AwssNZzjNS2/eGI96WWWgLlI 5yHAYmvIP5Ii1aQeC979aOZe91YsaHk8hHv//IbdAOWM/+tm7/8snz5MBJlm5XE0 HEhuWVhIxsZcWgVwxV0CQp0A5d7LalsZv8YBSU6t2vvU9mr9ICbyurfFr/qXskxy yAdeIcrEsT3zDF7QCbwRfN/MMhWv0w9uLwsgTqH0MHxvTGhy4hONN0jjr1k44WnB AJssgHIW7gzC/dyLIHcJPKyGYfGnkArEN7AwrK4Uz+fBprO2Oy/rgHiqmfu0Jm+J 6VDJxvYENq4LJiRzoEjaWakNGy1VBUr/71p2WyQsEr8vaAiIJD/UmSwltjqRqVZ+ EhGX7avnt7Ovthp3muDr6gEoWLBAd2C9l9mrlcI7CmzjbFbxpoIemS4QGE2bMq7R reqr9Z3NoTxWtTMaJEW8g075cY2h1VmR0Q351/xXkczGQLTtwiSQ6gLcQQ9iv2eX wCoP3oTQ9zMFiiOi1+jxhrTNVny/fM3DCrAzNFkder4pAY6Ve9dWPDiQqS8Vi7Jz 1YfCzSDkZGuL0qIJmPQ/ =B4k6 -----END PGP SIGNATURE----- --Apple-Mail=_DFFB5D43-43DC-4237-85C9-1689C3609D2A--