incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "C. Scott Andreas" <csco...@urbanairship.com>
Subject Flush / Snapshot Triggering Full GCs, Leaving Ring
Date Wed, 06 Apr 2011 19:07:22 GMT
Hello,

We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total
memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry
of query failures during the night corresponding with our backup schedule. More specifically,
our logs suggest that calling "nodetool snapshot" on a node is triggering 12 to 16 second
CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which
the node is marked dead by the ring until re-joining shortly after.

Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1

At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot.
(After investigating, we noticed that calling both flush and snapshot is unnecessary, and
have since updated the script to only call snapshot). While writing memtables, we'll generally
see a GC logged out via Cassandra such as:

"GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is
4424663040."

In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection,
resulting in a full stop-the-world GC like this (different node):

1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10,
real=6.46 secs] 
 (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 3958389K->1700629K(4185792K),
[CMS Perm : 19610K->19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10
secs]

During the GC, the rest of the ring will shun the node, and when the collection completes,
the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly
after once detecting each other as up and completing hinted handoff (details in log).

We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during
a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot,
but the CMS length was shorter (9 seconds) and did not result in the node being shunned from
the ring.

While the query failures that result from this activity are brief, our retry threshold is
set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like
to figure out why we see long CMS collections + promotion failures triggering full GCs during
a snapshot.

Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during
a flush / snapshot?

Thanks,

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com
Mime
View raw message