incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teijo Holzer <thol...@wetafx.co.nz>
Subject Recovering from a multi-node cluster failure caused by OOM on repairs
Date Wed, 27 Jul 2011 02:21:32 GMT
Hi,

I thought I share the following with this mailing list as a number of other
users seem to have had similar problems.

We have the following set-up:

OS: CentOS 5.5
RAM: 16GB
JVM heap size: 8GB (also tested with 14GB)
Cassandra version: 0.7.6-2 (also tested with 0.7.7)
Oracle JDK version: 1.6.0_26
Number of nodes: 5
Load per node: ~40GB
Replication factor: 3
Number of requests/day: 2.5 Million (95% inserts)
Total net insert data/day: 1GB
Default TTL for most of the data: 10 days

This set-up has been operating successfully for a few months, however recently
we started seeing multi-node failures, usually triggered by a repair, but 
occasionally also under normal operation. A repair on node 3,4 and 5 would
always cause the cluster as whole to fail, whereas node 1 & 2 completed their
repair cycles successfully.

These failures would usually result in 2 or 3 nodes becoming unresponsive and
dropping out of the cluster, resulting in client failure rates to spike up to
~10%. We normally operate with a failure rate of <0.1%.

The relevant log entries showed a complete heap memory exhaustion within 1
minute (see log lines below where we experimented with a larger heap size of
14GB). Also of interest was a number of huge SliceQueryFilter collections
running concurrently on the nodes in question (see log lines below).

The way we ended recovering from this situation was as follows. Remember these
steps were taken to get an unstable cluster back under control, so you might
want to revert some of the changes once the cluster is stable again.

Set "disk_access_mode: standard" in cassandra.yaml
This allowed us to prevent the JVM blowing out the hard limit of 8GB via large
mmaps. Heap size was set to 8GB (RAM/2). That meant the JVM was never using
more than 8GB total. mlockall didn't seem to make a difference for our
particular problem.

Turn off all row & key caches via cassandra-cli, e.g.
update column family Example with rows_cached=0;
update column family Example with keys_cached=0;
We were seeing compacted row maximum sizes of ~800MB from cfstats, that's why
we turned them all off. Again, we saw a significant drop in the actual memory
used from the available maximum of 8GB. Obviously, this will affect reads, but
as 95% of our requests are inserts, it didn't matter so much for us.

Bootstrap problematic node:
Kill Cassandra
Change "auto_bootstrap: true" in cassandra.yaml, remove own IP address from
list of seeds (important)
Delete all data directories (i.e. commit-log, data, saved-caches)
Start Cassandra
Wait for bootstrap to finish (see log & nodetool)
Change "auto_bootstrap: false"
(Run repair)

The first bootstrap completed very quickly, so we decided to bootstrap every
node in the cluster (not just the problematic ones). This resulted in some data
loss. The next time we will follow the bootstrap by a repair before
bootstrapping & repairing the next node to minimize data loss.

After this procedure, the cluster was operating normally again.

We now run a continuous rolling repair, followed by a (major) compaction and a
manual garbage collection. As the repairs a required anyway, we decided to run
them all the time in a continuous fashion. Therefore, potential problems can be 
identified earlier.

The major compaction followed by a manual GC allows us to keep the disk usage 
low on each node. The manual GC is necessary as the unused files on disk are 
only really deleted when the reference is garbage collected inside the JVM (a 
restart would achieve the same).

We also collected some statistics in regards to the duration of some of the
operations:

cleanup/compact: ~1 min/GB
repair: ~2-3 min/GB
bootstrap: ~1 min/GB

This means that if you have a node with 60GB of data, it will take ~1hr to
compact and ~2-3hrs to repair. Therefore, it is advisable to keep the data per
node below ~120GB. We achieve this by using an aggressive TTL on most of our
writes.

Cheers,

    Teijo

Here are the relevant log entries showing the OOM conditions:


[2011-07-21 11:12:11,059] INFO: GC for ParNew: 1141 ms, 509843976 reclaimed
leaving 1469443752 used; max is 14675869696 (ScheduledTasks:1 GCInspector.java:128)
[2011-07-21 11:12:15,226] INFO: GC for ParNew: 1149 ms, 564409392 reclaimed
leaving 2247228920 used; max is 14675869696 (ScheduledTasks:1 GCInspector.java:128)
...
[2011-07-21 11:12:55,062] INFO: GC for ParNew: 1110 ms, 564365792 reclaimed
leaving 12901974704 used; max is 14675869696 (ScheduledTasks:1
GCInspector.java:128)

[2011-07-21 10:57:23,548] DEBUG: collecting 4354206 of 2147483647:
940657e5b3b0d759eb4a14a7228ae365:false:41@1311102443362542 (ReadStage:27
SliceQueryFilter.java:123)

Mime
View raw message