couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Marca <jma...@translab.its.uci.edu>
Subject couchdb crashes silently
Date Fri, 13 Sep 2013 22:20:06 GMT
I am seeing a lot of random, silent crashes on just *one* of my
CouchDB servers.

couchdb version 1.4.0 (gentoo ebuild)

erlang also from gentoo ebuild: 
Erlang (BEAM) emulator version 5.10.2
Compiled on Fri Sep 13 08:39:20 2013
Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
[async-threads:10] [kernel-poll:false]

I've got 3 servers running couchdb, A, B, C, and only B is crashing.
All of them are replicating a single db between them, with B acting as
the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
B.

All three servers have data crunching jobs running that are reading
and writing to the database that is being replicated around.

The B server, the one in the middle that is push replicating to both A
and C, is the one that is crashing.

The log looks like this:

[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00
404
[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00
404
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target`
(document `carb_grid_state4k_push_emma64`).
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous`
(document `carb_grid_state5k_hpms_push_kitty`).

no warning that the server died or why, and nothing in the
/var/log/messages about anything untoward  happening (no OOM killer
invoked or anything like that)

The restart only happened because I manually did a 
/etc/init.d/couchdb restart
Usually couchdb restarts itself, but not with this crash.



I flipped the log to debug level, and still had no warning about the crash:

[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs
{1,1} from "128.xxx.xx.yy"
Headers: [{'Accept',"application/json"},
          {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
          {'Content-Length',"346"},
          {'Content-Type',"application/json"},
          {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
          {'User-Agent',"CouchDB/1.4.0"},
          {"X-Couch-Full-Commit","false"}]
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size
128531 bytes

And that was it.  CouchDB was down and out.

I even tried shutting off the data processing (so as to reduce the db
load) on box B, but that didn't help (all the crashing has put it far
behind in replicating box A and C).

My guess is that the replication load is too big (too many
connections, too much data being pushed in), but I would expect some
sort of warning before the server dies.  

Any clues or suggestions would be appreciated.  I am currently going
to try compling from source directly, but I don't have much faith that
it will make a difference.

Thanks,
James Marca

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


Mime
View raw message